Hi all,
I have a strange issue with a Windows Server 2022 VM: every few seconds the whole system gets stuck for a second or so. Apparently if loaded more, the often it happens, but always just a short moments, usually a second or less, sometimes up to 2 seconds, but not more. It is a RDP server and working with these "lags" is really bad. Issue is with RDP but with web console (noVNC) as well and can be visualized inside the VM without need for network.
I'm chasing this since several days, made a lot of tests and read many hours, but was not able to solve the issue yet. The VM runs on a two socket server (2 x 16C32T elderly Xeon) with 2 x PM1653 SAS (ZFS mirror) plus some spinning disks (for backup, not used by the VM), 256 GB RAM, all ZFS only; on PVE node summary page, RAM usage is below 50% (~115 GB), CPU day max 40%. The node has no swap space.
The VM has RAM usage ~30%, but during office time high CPU, 60-80% max during office hours. Inside the VM, I see a few 2-5% processes and a total of up to 80% (but I think adding all processes give 30%, not 80%). On top I often saw "interrupts" ("Systemunterbrechnungen") with ~2% CPU load.
The VM itself seems not to notice the lags, as it would hang entirely for a moment. "World freeze". When I ping from outside, I see high ping RTTs (500-2000 ms) perfectly correlate with the input lags / hangs. When I ping from the inside, I can see the Ping hang, but it claims =<3ms, as if the "clock" for the ping would also hang. However, when I use HD Tune Pro, a disk benchmark tool, I see not only the tool hang for a second or two, but also afterwards get a "down-spike" in the read speed, again 100% correlation to the "hangs". So Windows ping does not see the issue, but HD Tune Pro does. From the latter I conclude that I don't have any network related issue.
The problem is much worse during office hours that at night, but I failed to artificially provoke it (so that I could test at night in a maintenance window): I see no relatation to running prime95 on all vCores, nor running disk read benchmark, the problem does not get any worse by that.
What I already tried:
What does correlate good to is CPU wait. I run vmstat 1 and see
In the VM, usually neither CPU nor disk I/O looks bad, Windows Performance Indicator report marks all green. In Task Manager there are a few 2-5% tasks only, but on top often I saw "Systemunterbrechnungen" which I think relates to interrupts.
Unfortunately the problem seems to be best visible when several people work on the server (and I fail to stresstest-provoke it) and I cannot reboot it without maintance windows (night shifts work on it too).
I'm out of ideas and hope someone can point me to a direction what I could try next, please!
I have a strange issue with a Windows Server 2022 VM: every few seconds the whole system gets stuck for a second or so. Apparently if loaded more, the often it happens, but always just a short moments, usually a second or less, sometimes up to 2 seconds, but not more. It is a RDP server and working with these "lags" is really bad. Issue is with RDP but with web console (noVNC) as well and can be visualized inside the VM without need for network.
I'm chasing this since several days, made a lot of tests and read many hours, but was not able to solve the issue yet. The VM runs on a two socket server (2 x 16C32T elderly Xeon) with 2 x PM1653 SAS (ZFS mirror) plus some spinning disks (for backup, not used by the VM), 256 GB RAM, all ZFS only; on PVE node summary page, RAM usage is below 50% (~115 GB), CPU day max 40%. The node has no swap space.
The VM has RAM usage ~30%, but during office time high CPU, 60-80% max during office hours. Inside the VM, I see a few 2-5% processes and a total of up to 80% (but I think adding all processes give 30%, not 80%). On top I often saw "interrupts" ("Systemunterbrechnungen") with ~2% CPU load.
The VM itself seems not to notice the lags, as it would hang entirely for a moment. "World freeze". When I ping from outside, I see high ping RTTs (500-2000 ms) perfectly correlate with the input lags / hangs. When I ping from the inside, I can see the Ping hang, but it claims =<3ms, as if the "clock" for the ping would also hang. However, when I use HD Tune Pro, a disk benchmark tool, I see not only the tool hang for a second or two, but also afterwards get a "down-spike" in the read speed, again 100% correlation to the "hangs". So Windows ping does not see the issue, but HD Tune Pro does. From the latter I conclude that I don't have any network related issue.
The problem is much worse during office hours that at night, but I failed to artificially provoke it (so that I could test at night in a maintenance window): I see no relatation to running prime95 on all vCores, nor running disk read benchmark, the problem does not get any worse by that.
What I already tried:
- 1 socket 10 cores
- 1 socket 24 cores
- 2 socket 12 cores + NUMA
- 48 GB RAM
- add swap inside the VM (the host has no swap space, but uses only ~115 of 256 GB RAM)
- Microsoft\TIP\TestResults check (nothing big there)
- netsh int tcp set global rsc=disabled
- Get-NetAdapterRsc | Disable-NetAdapterRsc
- disable energy saving (max power profile, and disable whatever I could find) and setting display sleep to never.
- check MSI IRQ (all are MSI except Balloon, SM und USB, these have each own IRQ number)
- there is no hyper-V, there is no WSL (Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
What does correlate good to is CPU wait. I run vmstat 1 and see
Code:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
9 0 0 5545684 138644288 2925236 0 0 16 5044 20044 148127 0 1 82 0 0
5 0 0 5545684 138644288 2925236 0 0 0 8544 23378 151690 0 6 78 0 0
16 0 0 5545684 138644288 2925236 0 0 4 536 21245 153818 0 1 80 0 0
28 0 0 5545684 138644288 2925236 0 0 136 7684 23492 158319 0 3 77 0 0
# (lag here)
23 0 0 5545684 138644288 2925236 0 0 0 9948 29364 95193 0 28 57 0 0
20 0 0 5545684 138644288 2925236 0 0 0 51228 20392 182788 0 2 86 0 0
10 0 0 5545684 138644288 2925236 0 0 4 22852 18058 120430 0 1 90 0 0
11 0 0 5545684 138644288 2925236 0 0 0 376 16074 98264 0 1 91 0 0
4 0 0 5545684 138644288 2925236 0 0 0 104 15975 98421 0 1 91 0 0
26 0 0 5545684 138644288 2925236 0 0 0 7392 27543 58830 0 23 69 0 0
# (lag here)
15 0 0 5545684 138644288 2925236 0 0 40 37276 24181 83393 0 13 80 0 0
13 0 0 5545684 138644288 2925236 0 0 248 572 15575 94255 0 1 92 0 0
7 0 0 5545684 138644288 2925236 0 0 8 808 17581 92197 0 1 89 0 0
16 0 0 5545684 138644288 2925236 0 0 96 10072 16511 97244 0 1 88 0 0
17 0 0 5545684 138644288 2925236 0 0 28 16204 16982 119799 0 1 85 0 0
8 0 0 5545684 138644288 2925236 0 0 8 53624 23660 134408 0 8 76 0 0
21 1 0 5545684 138644288 2925236 0 0 8 21352 17478 147227 0 2 79 0 0
17 0 0 5545684 138644288 2925236 0 0 4 32704 20080 153549 0 2 74 0 0
29 0 0 5545684 138644288 2925236 0 0 0 44948 46321 111858 0 22 59 0 0
# (lag here)
13 0 0 5545684 138644288 2925236 0 0 0 16372 29618 129393 0 9 78 0 0
14 0 0 5545684 138644288 2925236 0 0 0 150120 23411 176360 0 5 81 0 0
18 0 0 5545684 138644288 2925236 0 0 40 9496 39449 84687 0 29 61 0 0
# (lag here)
5 0 0 5545684 138644288 2925236 0 0 24 900 14759 96520 0 1 92 0 0
10 0 0 5545684 138644288 2925236 0 0 40 740 14762 92890 0 1 87 0 0
8 0 0 5545684 138644288 2925236 0 0 0 44 76008 67731 0 8 75 0 0
In the VM, usually neither CPU nor disk I/O looks bad, Windows Performance Indicator report marks all green. In Task Manager there are a few 2-5% tasks only, but on top often I saw "Systemunterbrechnungen" which I think relates to interrupts.
Unfortunately the problem seems to be best visible when several people work on the server (and I fail to stresstest-provoke it) and I cannot reboot it without maintance windows (night shifts work on it too).
I'm out of ideas and hope someone can point me to a direction what I could try next, please!
Code:
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-8-pve)
Code:
root@pve-2:~# cat /etc/pve/qemu-server/107.conf
agent: 1
bios: ovmf
boot: order=virtio0;ide2;net0
cores: 12
cpu: host
efidisk0: local-zfs:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
machine: pc-q35-8.0
memory: 49152
meta: creation-qemu=8.0.2,ctime=1695813306
name: w2k22-ts
net0: virtio=0E:4B:CB:cc:bb:cc,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=28f74c6e-bde3-49d5-b215-68a4031512803
sockets: 2
virtio0: local-zfs:vm-102-disk-1,cache=writethrough,iothread=1,size=432G
vmgenid: d16b6ad8-226f-4baf-a4d8-564331511392f
[PENDING]
balloon: 0
vga: virtio
Attachments
Last edited: