Windows 2019 is very slow and unstable after upgrading to Proxmox V8

stevechu

New Member
Jul 10, 2023
6
2
3
My server is DL390 G9 with 2xE5 2686v4 and 256GB of DDR4 memory. This windows VM is the only one that's running on the server at the moment.
This is the VM's config:

agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 40
cpu: host
machine: pc-q35-7.2
memory: 65536
meta: creation-qemu=7.2.0,ctime=1686661732
name: captcha
net0: virtio=CE:E9:F1:CB:44:AB,bridge=vmbr0
numa: 1
ostype: win10
scsi0: local-zfs:vm-100-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=404d5c78-49fc-417a-8f4f-907e2b5f8be0
sockets: 1
vmgenid: 6800b79c-f4c9-4db4-a2a7-46103d5ff591

The symptom is:
1. Remote Desktop connection lost frequently. It happens with Proxmox Web VNC too.
2. Everything on the VM just is just laggy.

The exact same VM works very smoothly on Proxmox 7.0.
 
For what in hell do you need 40 cores?
Anyway, Numa is on but sockets is just one. Change socket to 2 and half the cores.
Should work better.
 
  • Like
Reactions: vherrlein
Also the disk setup.
It should have cache=none and iothread=1
iothread=1 is only working with scsihw: virtio-scsi-single (as I know).
But I don't think this is responsible for those problems.
I would suggest cache=none on ZFS too.
 
Last edited:
  • Like
Reactions: vherrlein
iothread=1 is only working with scsihw: virtio-scsi-single (as I know).
But I don't think this is responsible for those problems.
I would suggest cache=none on ZFS too.
Sorry missed the storage controller in the config file.
Which shall be changed to SCSI Single.
https://pve.proxmox.com/wiki/QEMU/KVM_Virtual_Machines

The setup here looks similar to that old doc:
https://pve.proxmox.com/wiki/Windows_10_guest_best_practices

Unfortunately, here the storage backend is ZFS not LVM or any remote (CEPH/iSCSI)

So, if the storage is not well configured, IOPs are so bad that Windows will translate has lags and freezes.

For my guess, @stevechu can try to run a CrystalDiskMark bench to confirm the issue.
 
Last edited:
I think my problem must be related to Linux Kernel 6 or new QEMU 8. Currently with the latest Proxmox version 7, the kernel is 5.15, it works perfectly fine.
My other system running a Threadripper is totally great with Proxmox 8.
It seems some other users are having the same problem as I am (newer kernels or newer QEMU):

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459

@vherrlein That VM runs well on Proxmox 7 not 8. I've been running VMs on Proxmox for years and it is amazing and this is the first time it happens, only after upgrading to 8.
 
I think my problem must be related to Linux Kernel 6 or new QEMU 8. Currently with the latest Proxmox version 7, the kernel is 5.15, it works perfectly fine.
My other system running a Threadripper is totally great with Proxmox 8.
It seems some other users are having the same problem as I am (newer kernels or newer QEMU):

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459

@vherrlein That VM runs well on Proxmox 7 not 8. I've been running VMs on Proxmox for years and it is amazing and this is the first time it happens, only after upgrading to 8.
The same as me, before migrating to 8.0 I was under 7.4 and Kernel 5.19.

The issue is clearly how QEMU handle iov memory allocation before writing to virtual disks within the new version.
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2025591

That why in your case, try to change the controller within SCSI Single, write cache none, iothread enabled and aio with default uri (operation to do when the VM is off)

Check also sector size of Windows disk, QEMU logical/physical block size and ZFS Volume vm-100-disk-0 are aligned.
But here WARNING, if windows partitions are 512B sector size, you’ll have to wait on the path which is currently under review.
Changing those block size settings can be data destructive.
 
The same as me, before migrating to 8.0 I was under 7.4 and Kernel 5.19.

The issue is clearly how QEMU handle iov memory allocation before writing to virtual disks within the new version.
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2025591

That why in your case, try to change the controller within SCSI Single, write cache none, iothread enabled and aio with default uri (operation to do when the VM is off)

Check also sector size of Windows disk, QEMU logical/physical block size and ZFS Volume vm-100-disk-0 are aligned.
But here WARNING, if windows partitions are 512B sector size, you’ll have to wait on the path which is currently under review.
Changing those block size settings can be data destructive.
Thank you, I'll try this solution!
Update: It didn't work. It resulted in lower disk benchmark score. And the VM is still unresponsive frequently with Proxmox 8.
 

Attachments

  • 7.4.2-5.15.png
    7.4.2-5.15.png
    336.7 KB · Views: 21
  • 7.4.2-5.15-single-nocache.png
    7.4.2-5.15-single-nocache.png
    344.4 KB · Views: 18
Last edited:
  • Like
Reactions: vherrlein
how old are your qvo 1TB in raidz1 ? are running raidz1 from the beginning ? can you share your wear level ? tbw ?
 
how old are your qvo 1TB in raidz1 ? are running raidz1 from the beginning ? can you share your wear level ? tbw ?
To test the stability of the system, I always reinstalled Proxmox and created a new pool. No it is not a QVO. It is 4x2TB of Intel S4510. Wearout is 0%. TBW was less than 100TB. The drives are rated 7PB TBW so still long way to go.
 
  • Like
Reactions: _gabriel
how old are your qvo 1TB in raidz1 ? are running raidz1 from the beginning ? can you share your wear level ? tbw ?
1 year in raiz1 since the dataset have been created with new QVO ssds.
I have 2 machine with the same setup (2x3 QVO SSDs) they are all with 4% of wearout and a tbw calculated ~25TB.
All have the aproxymatelly same smart values as following.
Code:
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 QVO 1TB
Serial Number:    XXXXXXXXXXXXXXXX
LU WWN Device Id: 5 002538 f4236896d
Firmware Version: SVQ02B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 12 09:02:26 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       6583
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       9
177 Wear_Leveling_Count     0x0013   096   096   000    Pre-fail  Always       -       39
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   061   055   000    Old_age   Always       -       39
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       7
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       49436365299
 
Last edited:
  • Like
Reactions: _gabriel
I think I managed to resolve the problem by disabling Intel vulnerability mitigations. Newer kernels (after 5.15) seem to change something related to those vulnerabilities.
 
I think I managed to resolve the problem by disabling Intel vulnerability mitigations. Newer kernels (after 5.15) seem to change something related to those vulnerabilities.
How you do that ? BIOS ?

Edit: thanks, it was obvious to modify the grub. The additional argument is ‘mitigations=off’

As soon as I can restart, I’ll check the differences
 
Last edited:
How you do that ? BIOS ?

Edit: thanks, it was obvious to modify the grub. The additional argument is ‘mitigations=off’

As soon as I can restart, I’ll check the differences
After adding that you can check it using lscpu and ensure that there is no "mitigation" word. Also check if Proxmox is using systemd boot or grub.
 
Whawww, that increased the RAND 4k Writes IOPS by 200% :eek:

Thanks allot

before:
Code:
                        KDiskMark (3.1.4): https://github.com/JonMagon/KDiskMark
                    Flexible I/O Tester (fio-3.33): https://github.com/axboe/fio
--------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
Sequential   1 MiB (Q=  8, T= 1):  2076.388 MB/s [   2027.7 IOPS] <  3885.55 us>
Sequential   1 MiB (Q=  1, T= 1):   400.219 MB/s [    390.8 IOPS] <  2550.45 us>
    Random   4 KiB (Q= 32, T= 1):    60.223 MB/s [  15055.8 IOPS] <  2118.58 us>
    Random   4 KiB (Q=  1, T= 1):    17.162 MB/s [   4290.7 IOPS] <   226.52 us>

[Write]
Sequential   1 MiB (Q=  8, T= 1):   311.519 MB/s [    304.2 IOPS] <  6769.39 us>
Sequential   1 MiB (Q=  1, T= 1):   175.788 MB/s [    171.7 IOPS] <  2778.13 us>
    Random   4 KiB (Q= 32, T= 1):    47.912 MB/s [  11978.0 IOPS] <  2374.76 us>
    Random   4 KiB (Q=  1, T= 1):    14.222 MB/s [   3555.6 IOPS] <   266.69 us>

Profile: Default
   Test: 1 GiB (x1) [Measure: 5 sec / Interval: 5 sec]
   Date: 2023-07-11 18:51:12

after:
Code:
                        KDiskMark (3.1.4): https://github.com/JonMagon/KDiskMark
                    Flexible I/O Tester (fio-3.33): https://github.com/axboe/fio
--------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
Sequential   1 MiB (Q=  8, T= 1):  3382.503 MB/s [   3303.2 IOPS] <  2388.42 us>
Sequential   1 MiB (Q=  1, T= 1):   404.699 MB/s [    395.2 IOPS] <  2521.21 us>
    Random   4 KiB (Q= 32, T= 1):   107.338 MB/s [  26834.7 IOPS] <  1187.51 us>
    Random   4 KiB (Q=  1, T= 1):    18.919 MB/s [   4729.9 IOPS] <   205.41 us>

[Write]
Sequential   1 MiB (Q=  8, T= 1):   315.266 MB/s [    307.9 IOPS] <  6525.32 us>
Sequential   1 MiB (Q=  1, T= 1):   222.722 MB/s [    217.5 IOPS] <  2498.83 us>
    Random   4 KiB (Q= 32, T= 1):    81.184 MB/s [  20296.1 IOPS] <  1563.56 us>
    Random   4 KiB (Q=  1, T= 1):    15.386 MB/s [   3846.6 IOPS] <   247.74 us>

Profile: Default
   Test: 1 GiB (x1) [Measure: 5 sec / Interval: 5 sec]
   Date: 2023-07-12 07:32:57
     OS: debian 12 [linux 6.1.0-10-amd64]
 
Last edited:
Hi,
Don't know if i can post here in this form. If i need to create a new topic, let me know.

But i am facing the same problems, and yes the mitigations=off is a help.

But i;m not happy with having this in my system.
Aren't there concerns running this system in this way?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!