RDS/Terminal Server random freeze, console not responding

check-ict

Well-Known Member
Apr 19, 2011
102
18
58
Hello,

We have a issue with out VM's for over 2 years now. We host multiple Terminal Servers within our Proxmox cluster. They are based on Server 2012 R2 with multiple user sessions and normal programs like Office.

Sometimes, these VM's randomly freeze. This happends only once in 2 or 3 month per VM. But we are running about 15 Terminal Servers now and it's a routine now to reboot them every week.

We changed all hardware 3 months ago, moving from Hybrid ZFS storage to SSD ZFS storage. The IO is never a issue, the SSD's are really fast and the disk IO is minimal. Also we changed all Proxmox nodes with new hardware and installed Proxmox 4.2 (old setup was 3.4). We connect to the ZFS storage with ZFS over a bond of 4x1Gbit (intel quad port).

We also run about 120 other VM's who don't have these issues, it only happends on Terminal Servers where users log in.

This is what happends:
The VM becomes unreachable, however you can still ping the VM and Nagios checks work (like disk, cpu, uptime etc.). Noone is able to login to the server. When we open the console, the welcome screen of Windows shows and you have to send CTRL+ALT+DELETE. When we sent this command, nothing happends. Also a reset or shutdown does not work. The only way to fix it is to stop the VM and then start it.

When we look back in the logs, nothing was wrong. There was no load, no special user action, no errors...

We also tried E1000 and IDE instead of Virtio but this doesn't solve the problem. We do notice that Virtio disks on a Terminal Server seem to hang faster (once a month per VM, where IDE is about once in 3 months).

These 15 terminal servers all have different software, different installation times, some have latest updates and some not. They are for different customers and have different workload. The only thing they have in common is they are Terminal Servers where user log in to.

We allready disabled SWAP, DEP and AV. Disabling SWAP seems to have a little impact, it crashes less.

Our indications go to storage, however we have really fast storage. We use 2 live SSD servers (24x1TB SSD each storage unit) and VM's have problems on both of them. Our previous storage was 12x2TB disks + 2x SSD for cache, they had the same problems and my thoughts were that these units might be overloaded on peak moments by other VM's.

Also we noticed that this problem does not occur when we use local disks. It only happends on NFS shared storage so far. We have a Proxmox node that has 8x1TB disks + 2x500 GB SSD with ZFS and that works fine, Terminal Servers don't crash on this stand-alone server.

Current set-up:
5x Proxmox 4.2 nodes with 2 ports in LACP bond
2x SSD ZFS NFS nodes with 4 ports in ALB bond
Managed 48p gigabit switch with LACP memberships for the Proxmox nodes

Any suggestions what we can do to prevent the Terminal Servers from freezing? Any idea why the Proxmox console does not work on the moment the VM freezes? And what can prevent a "reset" from working?

The biggest question is, is this a Proxmox/KVM issue? NFS issue? Windows issue? We need to know because last time we thought it was hardware/proxmox version and after a large upgrade of all hardware and software the result is the same.
 
Have you ever did some NFS Performance Tests ? How does your NFS configuration Looks like?

We've had similar Problems on Proxmox verions prior version 3. some of Our Engineers nailed the issues down by tweaking the tcp Stack and NFS configuration.

Your issues may be NFS related and not Hardware.
 
We have no special NFS settings.

We installed Debian 8 with nfs-kernel-server, installed zfs on linux and did "zfs sharenfs=on data/proxmox".

Please let me know if I need to tweak something. We can try this on a third test server and check the results.
 
Naturally, if you only have a problem in windows, it's a windows problem. What about the logs there? Since monitoring reports nothing wrong as you said, it has to be something with the terminal server itself. Have you tried to debug the network stack to see if something is wrong there? QEMU and therefore Proxmox VE opens for every VM at least one (according to the the number of NICs) tap interfaces, where you can tcpdump the traffic and analyze with wireshark. Maybe this helps.
 
I will check with tcpdump on a hanging VM next time. Still weird that Proxmox console also locks up, only stop/start works.
 
Hi,

Anything new on this problem?
We had a few years ago similar random hangs with nothing in event log.
Then we moved to a network syslog and saw event id 129 (viostor). In fact, VM does NOT hang, but becomes unresponsive and seems freezed. It is just waiting for storage...
See this thread : https://forum.proxmox.com/threads/some-windows-guests-hanging-every-night.20046.
All affected VMs were RDS servers.
Storage was on a SAN, not a NAS, but your description seems really close to what we observed...

One very interesting point is that you didn't see the problem happen on local storage : i'll try that.

Christophe.
 
Nothing new yet, we just stop/start the VM's that freeze.

We have multiple customers that have a local server. We install them stand-alone with Proxmox/ZFS and use local disk storage. These customers have no problems running a Terminal Server. We also have 1 server in our datacenter that runs 2 Terminal Servers without freezes.

We created a extra Nagios check that tells us if a Terminal Server is frozen so we can reboot them asap. This reduces the impact a lot, so a customer will only notice a hangup once a year during business hours.

Not a perfect solution but it works for now. Starting to look at other virtualisation software meanwhile.
 
  • Like
Reactions: mstsas
Hi,

I have a similar issue on new installation of Proxmox 5.3.

# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-35
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-1
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-44
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

I've 3 VM identical with Windows 2012 R2 and Terminal Server. Randomly once of this server become unresponsive and solution is start/stop.

I've checked host log, VM log and there no error or warnings. Any suggestion will be appreciated.
 
We solved our problem. We used stable virtio drivers (not latest) and we disable RSC on ipv4 and ipv6 on the virtio adapters.
 
On the hosted Windows Server OS, navigate to Network Connections, open Ethernet Adapter properties and disable RSC for both IPv4 and iPv6 Once you have manually disabled RSC on the network adapter, you must restart the machine for the changes to take effect.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!