Hello,
We have a issue with out VM's for over 2 years now. We host multiple Terminal Servers within our Proxmox cluster. They are based on Server 2012 R2 with multiple user sessions and normal programs like Office.
Sometimes, these VM's randomly freeze. This happends only once in 2 or 3 month per VM. But we are running about 15 Terminal Servers now and it's a routine now to reboot them every week.
We changed all hardware 3 months ago, moving from Hybrid ZFS storage to SSD ZFS storage. The IO is never a issue, the SSD's are really fast and the disk IO is minimal. Also we changed all Proxmox nodes with new hardware and installed Proxmox 4.2 (old setup was 3.4). We connect to the ZFS storage with ZFS over a bond of 4x1Gbit (intel quad port).
We also run about 120 other VM's who don't have these issues, it only happends on Terminal Servers where users log in.
This is what happends:
The VM becomes unreachable, however you can still ping the VM and Nagios checks work (like disk, cpu, uptime etc.). Noone is able to login to the server. When we open the console, the welcome screen of Windows shows and you have to send CTRL+ALT+DELETE. When we sent this command, nothing happends. Also a reset or shutdown does not work. The only way to fix it is to stop the VM and then start it.
When we look back in the logs, nothing was wrong. There was no load, no special user action, no errors...
We also tried E1000 and IDE instead of Virtio but this doesn't solve the problem. We do notice that Virtio disks on a Terminal Server seem to hang faster (once a month per VM, where IDE is about once in 3 months).
These 15 terminal servers all have different software, different installation times, some have latest updates and some not. They are for different customers and have different workload. The only thing they have in common is they are Terminal Servers where user log in to.
We allready disabled SWAP, DEP and AV. Disabling SWAP seems to have a little impact, it crashes less.
Our indications go to storage, however we have really fast storage. We use 2 live SSD servers (24x1TB SSD each storage unit) and VM's have problems on both of them. Our previous storage was 12x2TB disks + 2x SSD for cache, they had the same problems and my thoughts were that these units might be overloaded on peak moments by other VM's.
Also we noticed that this problem does not occur when we use local disks. It only happends on NFS shared storage so far. We have a Proxmox node that has 8x1TB disks + 2x500 GB SSD with ZFS and that works fine, Terminal Servers don't crash on this stand-alone server.
Current set-up:
5x Proxmox 4.2 nodes with 2 ports in LACP bond
2x SSD ZFS NFS nodes with 4 ports in ALB bond
Managed 48p gigabit switch with LACP memberships for the Proxmox nodes
Any suggestions what we can do to prevent the Terminal Servers from freezing? Any idea why the Proxmox console does not work on the moment the VM freezes? And what can prevent a "reset" from working?
The biggest question is, is this a Proxmox/KVM issue? NFS issue? Windows issue? We need to know because last time we thought it was hardware/proxmox version and after a large upgrade of all hardware and software the result is the same.
We have a issue with out VM's for over 2 years now. We host multiple Terminal Servers within our Proxmox cluster. They are based on Server 2012 R2 with multiple user sessions and normal programs like Office.
Sometimes, these VM's randomly freeze. This happends only once in 2 or 3 month per VM. But we are running about 15 Terminal Servers now and it's a routine now to reboot them every week.
We changed all hardware 3 months ago, moving from Hybrid ZFS storage to SSD ZFS storage. The IO is never a issue, the SSD's are really fast and the disk IO is minimal. Also we changed all Proxmox nodes with new hardware and installed Proxmox 4.2 (old setup was 3.4). We connect to the ZFS storage with ZFS over a bond of 4x1Gbit (intel quad port).
We also run about 120 other VM's who don't have these issues, it only happends on Terminal Servers where users log in.
This is what happends:
The VM becomes unreachable, however you can still ping the VM and Nagios checks work (like disk, cpu, uptime etc.). Noone is able to login to the server. When we open the console, the welcome screen of Windows shows and you have to send CTRL+ALT+DELETE. When we sent this command, nothing happends. Also a reset or shutdown does not work. The only way to fix it is to stop the VM and then start it.
When we look back in the logs, nothing was wrong. There was no load, no special user action, no errors...
We also tried E1000 and IDE instead of Virtio but this doesn't solve the problem. We do notice that Virtio disks on a Terminal Server seem to hang faster (once a month per VM, where IDE is about once in 3 months).
These 15 terminal servers all have different software, different installation times, some have latest updates and some not. They are for different customers and have different workload. The only thing they have in common is they are Terminal Servers where user log in to.
We allready disabled SWAP, DEP and AV. Disabling SWAP seems to have a little impact, it crashes less.
Our indications go to storage, however we have really fast storage. We use 2 live SSD servers (24x1TB SSD each storage unit) and VM's have problems on both of them. Our previous storage was 12x2TB disks + 2x SSD for cache, they had the same problems and my thoughts were that these units might be overloaded on peak moments by other VM's.
Also we noticed that this problem does not occur when we use local disks. It only happends on NFS shared storage so far. We have a Proxmox node that has 8x1TB disks + 2x500 GB SSD with ZFS and that works fine, Terminal Servers don't crash on this stand-alone server.
Current set-up:
5x Proxmox 4.2 nodes with 2 ports in LACP bond
2x SSD ZFS NFS nodes with 4 ports in ALB bond
Managed 48p gigabit switch with LACP memberships for the Proxmox nodes
Any suggestions what we can do to prevent the Terminal Servers from freezing? Any idea why the Proxmox console does not work on the moment the VM freezes? And what can prevent a "reset" from working?
The biggest question is, is this a Proxmox/KVM issue? NFS issue? Windows issue? We need to know because last time we thought it was hardware/proxmox version and after a large upgrade of all hardware and software the result is the same.