RDS/Terminal Server random freeze, console not responding

check-ict · Aug 20, 2016

Hello,

We have a issue with out VM's for over 2 years now. We host multiple Terminal Servers within our Proxmox cluster. They are based on Server 2012 R2 with multiple user sessions and normal programs like Office.

Sometimes, these VM's randomly freeze. This happends only once in 2 or 3 month per VM. But we are running about 15 Terminal Servers now and it's a routine now to reboot them every week.

We changed all hardware 3 months ago, moving from Hybrid ZFS storage to SSD ZFS storage. The IO is never a issue, the SSD's are really fast and the disk IO is minimal. Also we changed all Proxmox nodes with new hardware and installed Proxmox 4.2 (old setup was 3.4). We connect to the ZFS storage with ZFS over a bond of 4x1Gbit (intel quad port).

We also run about 120 other VM's who don't have these issues, it only happends on Terminal Servers where users log in.

This is what happends:
The VM becomes unreachable, however you can still ping the VM and Nagios checks work (like disk, cpu, uptime etc.). Noone is able to login to the server. When we open the console, the welcome screen of Windows shows and you have to send CTRL+ALT+DELETE. When we sent this command, nothing happends. Also a reset or shutdown does not work. The only way to fix it is to stop the VM and then start it.

When we look back in the logs, nothing was wrong. There was no load, no special user action, no errors...

We also tried E1000 and IDE instead of Virtio but this doesn't solve the problem. We do notice that Virtio disks on a Terminal Server seem to hang faster (once a month per VM, where IDE is about once in 3 months).

These 15 terminal servers all have different software, different installation times, some have latest updates and some not. They are for different customers and have different workload. The only thing they have in common is they are Terminal Servers where user log in to.

We allready disabled SWAP, DEP and AV. Disabling SWAP seems to have a little impact, it crashes less.

Our indications go to storage, however we have really fast storage. We use 2 live SSD servers (24x1TB SSD each storage unit) and VM's have problems on both of them. Our previous storage was 12x2TB disks + 2x SSD for cache, they had the same problems and my thoughts were that these units might be overloaded on peak moments by other VM's.

Also we noticed that this problem does not occur when we use local disks. It only happends on NFS shared storage so far. We have a Proxmox node that has 8x1TB disks + 2x500 GB SSD with ZFS and that works fine, Terminal Servers don't crash on this stand-alone server.

Current set-up:
5x Proxmox 4.2 nodes with 2 ports in LACP bond
2x SSD ZFS NFS nodes with 4 ports in ALB bond
Managed 48p gigabit switch with LACP memberships for the Proxmox nodes

Any suggestions what we can do to prevent the Terminal Servers from freezing? Any idea why the Proxmox console does not work on the moment the VM freezes? And what can prevent a "reset" from working?

The biggest question is, is this a Proxmox/KVM issue? NFS issue? Windows issue? We need to know because last time we thought it was hardware/proxmox version and after a large upgrade of all hardware and software the result is the same.

sumsum · Aug 20, 2016

Have you ever did some NFS Performance Tests ? How does your NFS configuration Looks like?

We've had similar Problems on Proxmox verions prior version 3. some of Our Engineers nailed the issues down by tweaking the tcp Stack and NFS configuration.

Your issues may be NFS related and not Hardware.

check-ict · Aug 20, 2016

We have no special NFS settings.

We installed Debian 8 with nfs-kernel-server, installed zfs on linux and did "zfs sharenfs=on data/proxmox".

Please let me know if I need to tweak something. We can try this on a third test server and check the results.

LnxBil · Aug 21, 2016

Naturally, if you only have a problem in windows, it's a windows problem. What about the logs there? Since monitoring reports nothing wrong as you said, it has to be something with the terminal server itself. Have you tried to debug the network stack to see if something is wrong there? QEMU and therefore Proxmox VE opens for every VM at least one (according to the the number of NICs) tap interfaces, where you can tcpdump the traffic and analyze with wireshark. Maybe this helps.

check-ict · Aug 22, 2016

I will check with tcpdump on a hanging VM next time. Still weird that Proxmox console also locks up, only stop/start works.

christophe · Sep 7, 2016

Hi,

Anything new on this problem?
We had a few years ago similar random hangs with nothing in event log.
Then we moved to a network syslog and saw event id 129 (viostor). In fact, VM does NOT hang, but becomes unresponsive and seems freezed. It is just waiting for storage...
See this thread : https://forum.proxmox.com/threads/some-windows-guests-hanging-every-night.20046.
All affected VMs were RDS servers.
Storage was on a SAN, not a NAS, but your description seems really close to what we observed...

One very interesting point is that you didn't see the problem happen on local storage : i'll try that.

Christophe.

check-ict · Sep 7, 2016

Nothing new yet, we just stop/start the VM's that freeze.

We have multiple customers that have a local server. We install them stand-alone with Proxmox/ZFS and use local disk storage. These customers have no problems running a Terminal Server. We also have 1 server in our datacenter that runs 2 Terminal Servers without freezes.

We created a extra Nagios check that tells us if a Terminal Server is frozen so we can reboot them asap. This reduces the impact a lot, so a customer will only notice a hangup once a year during business hours.

Not a perfect solution but it works for now. Starting to look at other virtualisation software meanwhile.

mstsas · Feb 11, 2019

Hi,

I have a similar issue on new installation of Proxmox 5.3.

# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-35
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-1
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-44
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

I've 3 VM identical with Windows 2012 R2 and Terminal Server. Randomly once of this server become unresponsive and solution is start/stop.

I've checked host log, VM log and there no error or warnings. Any suggestion will be appreciated.

mstsas · Feb 11, 2019

check-ict said:
We created a extra Nagios check that tells us if a Terminal Server is frozen.

Not a perfect solution but it works for now. Starting to look at other virtualisation software meanwhile.

How to check if Terminal Server is frozen?

check-ict · Feb 11, 2019

We solved our problem. We used stable virtio drivers (not latest) and we disable RSC on ipv4 and ipv6 on the virtio adapters.

mstsas · Feb 11, 2019

Ok check-ict,

we too use stable virtio drivers but, I don't know how disable RSC. Can you give me instruction, please?!

check-ict · Feb 11, 2019

On the hosted Windows Server OS, navigate to Network Connections, open Ethernet Adapter properties and disable RSC for both IPv4 and iPv6 Once you have manually disabled RSC on the network adapter, you must restart the machine for the changes to take effect.

mstsas · Feb 15, 2019

I've applied this changed Monday but, Yesterday the issue came back!!!!

Hosted Windows Server response very slowly both vnc console and rdp logon (10-20 sec. timeout). We accept suggestions on where to investigate, please!!!

mac.linux.free · Feb 20, 2020

https://www.atlantic.net/cloud-hosting/how-to-disabling-tcp-offloading/

jjdoran · Aug 16, 2024

Check the memory is sufficient. It may appear in logs that services are struggling.
And windoze dies without any nice messages in the log about it.

HTH

Search

Search

RDS/Terminal Server random freeze, console not responding

check-ict

Well-Known Member

sumsum

Renowned Member

check-ict

Well-Known Member

LnxBil

Distinguished Member

check-ict

Well-Known Member

christophe

Renowned Member

check-ict

Well-Known Member

mstsas

Renowned Member

mstsas

Renowned Member

check-ict

Well-Known Member

mstsas

Renowned Member

check-ict

Well-Known Member

mstsas

Renowned Member

mac.linux.free

Renowned Member

jjdoran

Member

We value your privacy