Hello everybody!
I've the following problem and I'd like to understand whether it's something "usual" or if I'm doing something wrong.
I've some small guests running Windows Server 2012 R2, with 32 GB disk hosted on a SAN, 1 socket, 2 cores, 1 GB RAM, 2 virtual NICs. The hosts run a software that receives video streaming on a virtual NIC and then writes them to a SAN through the second virtual NIC and iSCSI.
The first guests I created ran smoothly.
As long as I created more guests the new machines were slower (but the first ones were still running fine). Physical machines have always less than 20% CPU (real cores) usage and still lot of free RAM.
It seems not really relevant the number of VMs running on a physical machine. The problem appears when there are just 3 guests running or 10.
I realized that new machines were slower probably because of disk IO (C:, not the network attached one). Indeed, they are really slow already at boot and also every time reading/writing to C: is needed.
IO delay is in the order of <2% or less during normal running (just video handling) on each physical machine, but raises when a user logs into the guests and use them interactively somehow.
Finally, and that's the strange part, if a user logs into a very slow guest and use it interactively for some time (like 10 minutes) using the PROXMOX console, the guest becomes more and more responsive and finally it runs very smoothly. However, then it becomes slow again after an apparently random time (from some hours to days).
Note: if I reboot a very slow VM from inside Windows, it takes many minutes for rebooting. If then I interactively use the machine for some time and it becomes fast, then also the reboot is very fast...
The slowest machines also often experience a sort of freeze and become completely unresponsive, and have to be stopped and booted again from PROXMOX. I read on this forum of many problems regarding Windows 2012 freezing. I tried many of the proposed solutions (updating VirtIO drivers, tuning guidelines and so on), but this problem does not disappear.
Do you have any clues of why this is happening?
Is there a way to avoid VMs to become so "slow" and make them "stay fast"?
Many thanks in advance!
Regards,
Adamo
Setup
PROXMOX cluster made up of 6 Dell PowerEdge R730
4 with 2 x Intel Xeon E5-2660v3
2 with 2 x Intel Xeon E5-2650v4 (added recently)
all with 64 GB RAM, 4 x 1Gbit NIC, 2 x 10Gbit NIC.
Each 4 x 1 Gbit NIC is bonded with LACP and are used for video and management network.
The same holds for each 2 x 10 Gbit NIC, which are used for storage.
Virtual machines are hosted on LVMs hosted on a cluster of 2 NetApp SANs (4 controllers) connected through 10 Gbit links and made visible to PROXMOX through iSCSI.
Windows virtual machines connect directly to other volumes on the same SAN, again through iSCSI.
The first 4 servers run pve-manager/4.1-1 (Linux 4.2.6-1-pve) while the last 2 run pve-maganer/4.4-1 (Linux 4.4.35-1-pve).
I'm in the process of gradually upgrading all the machines to the last available version of pve.
Note that the problem I've described was already present when I had only 4 machines, and it's still present even if I migrate old VMs to the nodes with an upgraded version of PVE.
I've the following problem and I'd like to understand whether it's something "usual" or if I'm doing something wrong.
I've some small guests running Windows Server 2012 R2, with 32 GB disk hosted on a SAN, 1 socket, 2 cores, 1 GB RAM, 2 virtual NICs. The hosts run a software that receives video streaming on a virtual NIC and then writes them to a SAN through the second virtual NIC and iSCSI.
The first guests I created ran smoothly.
As long as I created more guests the new machines were slower (but the first ones were still running fine). Physical machines have always less than 20% CPU (real cores) usage and still lot of free RAM.
It seems not really relevant the number of VMs running on a physical machine. The problem appears when there are just 3 guests running or 10.
I realized that new machines were slower probably because of disk IO (C:, not the network attached one). Indeed, they are really slow already at boot and also every time reading/writing to C: is needed.
IO delay is in the order of <2% or less during normal running (just video handling) on each physical machine, but raises when a user logs into the guests and use them interactively somehow.
Finally, and that's the strange part, if a user logs into a very slow guest and use it interactively for some time (like 10 minutes) using the PROXMOX console, the guest becomes more and more responsive and finally it runs very smoothly. However, then it becomes slow again after an apparently random time (from some hours to days).
Note: if I reboot a very slow VM from inside Windows, it takes many minutes for rebooting. If then I interactively use the machine for some time and it becomes fast, then also the reboot is very fast...
The slowest machines also often experience a sort of freeze and become completely unresponsive, and have to be stopped and booted again from PROXMOX. I read on this forum of many problems regarding Windows 2012 freezing. I tried many of the proposed solutions (updating VirtIO drivers, tuning guidelines and so on), but this problem does not disappear.
Do you have any clues of why this is happening?
Is there a way to avoid VMs to become so "slow" and make them "stay fast"?
Many thanks in advance!
Regards,
Adamo
Setup
PROXMOX cluster made up of 6 Dell PowerEdge R730
4 with 2 x Intel Xeon E5-2660v3
2 with 2 x Intel Xeon E5-2650v4 (added recently)
all with 64 GB RAM, 4 x 1Gbit NIC, 2 x 10Gbit NIC.
Each 4 x 1 Gbit NIC is bonded with LACP and are used for video and management network.
The same holds for each 2 x 10 Gbit NIC, which are used for storage.
Virtual machines are hosted on LVMs hosted on a cluster of 2 NetApp SANs (4 controllers) connected through 10 Gbit links and made visible to PROXMOX through iSCSI.
Windows virtual machines connect directly to other volumes on the same SAN, again through iSCSI.
The first 4 servers run pve-manager/4.1-1 (Linux 4.2.6-1-pve) while the last 2 run pve-maganer/4.4-1 (Linux 4.4.35-1-pve).
I'm in the process of gradually upgrading all the machines to the last available version of pve.
Note that the problem I've described was already present when I had only 4 machines, and it's still present even if I migrate old VMs to the nodes with an upgraded version of PVE.