For my company i am using Proxmox through a supplier for little over a year now and i am not to happy and i hope there are some people over here that can help me to find a solution. Hope you can bare with me as i write all and try to give a good view without making it to long.
We supply remote desktop services for over 10 years and were using VMware for al those years. Since 4 years we were running on a 4 node dell node system with single SAN below on a 2x 10 Gb network in the same dc rack. And we never had issues with performance. As the platform get fuller we were looking for replacement. And found a supplier of Proxmox which sounded great.
So as advised we bought 3 compute nodes ( 80 x Xeon Gold 6138 @2.00GHz 2 sockets / 768Gib mem ) and 3 storage nodes (don't have the hardware details) with only ssd's for 80% full. So ~10 ssd per node which on the old SAN was only spindle disks. All put in 3 datacentres with a 10gigabit network connecting the nodes. The Nodes are now running VE 6.3.2 and had one upgrade from some version of 5.x somewhere in the middle of 2020. Few weeks before any problem.
We stared moving our clients from VMWare to Proxmox in February 2020 manly reinstalling there systems completely so no (VMWare) drivers were left behind. After about 90% moving we stared to have performance issues. On the working days ( mostly Thursdays and Thursdays ) the system would get slow starting arount 12 and from ~15:00 would be to slow to work at all and we would get complains from everyone. No increased number of users at these times compared to the start of the day.
We started looking for problems with our supplier. We (from within Windows) and they measured (Windows and Proxmox) disk performance, network performance, Active directory, dns (on the AD) and never found something conclusive. We had a external person with 15+ years of experience with Linux and a small Proxmox cluster look at the system telling us: "if I look at what i measure, you have no problems at all" but clearly looking at working on the system, You have a problem. After adding NVMe to all compute nodes with no real improvements, around the end of October, we told the supplier we would go back to VMWare as we started to lose clients and we were no closer to a solution (3 weeks of little sleep). This made them put extra efford and said they found some problem with network under heavy cpu load and Windows but were not sure if this was a real problem as CPU's were never 100%. They added a small extra compute node to the cluster and the system started performing oke again.
We finished moving the last clients, added some new ones and after that, before running into the same problem again we got 2 extra nodes ( 48 x AMD EPYC 7272 12-core 2 sockets / 768Gib mem ) and added them to the cluster. Then Christmas and so on, started.
Right now we can continue and talking to other company's we know we made the right choice with Proxmox, but we are not relaxed because we have no idea when we will reach the same point and all starts to crumble all over, without knowing what to monitor and watch to make predictions and decisions.
As said we measured a lot even in the middle of the night and never found something always slow or always high. The network performance of Windows is always about 1/5 of the of the Linux machines (iPerf) Internode is slower for both but around the same 1/5 deference applies. Even now, running fine, that is the case trying several types of vNics and drivers and driver versions during the issues.
Adding a node to the system makes me think it is not the network between the nodes but i have no insight in that.
The nodes it self looked fine trough the eyes of the supplier (and me with to little experience) as there monitoring didn't give a warning. And they only added the node after 3 weeks of problems. So there we no direct indicators the system was full imo.
We have a subscription but i don't know how far the supplier used that. As a lot of times i got the answer that we would not get help with our "Windows" problems.
Last thing we did at the end of the year is buy a "home" machine with NVMe and installed Proxmox. That machine is for sure a lot faster than my profesional systems. A Windows AD with full Exchange boots in seconds and a login with mgmt tools loaded is ~3 seconds. I don't even get the with a fresh Windows install on the new empty nodes (using the Ceph storage nodes underneath not NVMe). I understand it is not a real comparison but it gives me the feeling there is room for (big) improvements somewhere.
My goal is to know when the system is getting to full capacity and i understand that i have to answer a lot of questions that i have to relay to the supplier to check if those thing are already monitored.
The other thing is, that i would like to find out why my home system is so much faster.
Thanks for reading and i will try to reply your questions as soon as i can will in production
We supply remote desktop services for over 10 years and were using VMware for al those years. Since 4 years we were running on a 4 node dell node system with single SAN below on a 2x 10 Gb network in the same dc rack. And we never had issues with performance. As the platform get fuller we were looking for replacement. And found a supplier of Proxmox which sounded great.
So as advised we bought 3 compute nodes ( 80 x Xeon Gold 6138 @2.00GHz 2 sockets / 768Gib mem ) and 3 storage nodes (don't have the hardware details) with only ssd's for 80% full. So ~10 ssd per node which on the old SAN was only spindle disks. All put in 3 datacentres with a 10gigabit network connecting the nodes. The Nodes are now running VE 6.3.2 and had one upgrade from some version of 5.x somewhere in the middle of 2020. Few weeks before any problem.
We stared moving our clients from VMWare to Proxmox in February 2020 manly reinstalling there systems completely so no (VMWare) drivers were left behind. After about 90% moving we stared to have performance issues. On the working days ( mostly Thursdays and Thursdays ) the system would get slow starting arount 12 and from ~15:00 would be to slow to work at all and we would get complains from everyone. No increased number of users at these times compared to the start of the day.
We started looking for problems with our supplier. We (from within Windows) and they measured (Windows and Proxmox) disk performance, network performance, Active directory, dns (on the AD) and never found something conclusive. We had a external person with 15+ years of experience with Linux and a small Proxmox cluster look at the system telling us: "if I look at what i measure, you have no problems at all" but clearly looking at working on the system, You have a problem. After adding NVMe to all compute nodes with no real improvements, around the end of October, we told the supplier we would go back to VMWare as we started to lose clients and we were no closer to a solution (3 weeks of little sleep). This made them put extra efford and said they found some problem with network under heavy cpu load and Windows but were not sure if this was a real problem as CPU's were never 100%. They added a small extra compute node to the cluster and the system started performing oke again.
We finished moving the last clients, added some new ones and after that, before running into the same problem again we got 2 extra nodes ( 48 x AMD EPYC 7272 12-core 2 sockets / 768Gib mem ) and added them to the cluster. Then Christmas and so on, started.
Right now we can continue and talking to other company's we know we made the right choice with Proxmox, but we are not relaxed because we have no idea when we will reach the same point and all starts to crumble all over, without knowing what to monitor and watch to make predictions and decisions.
As said we measured a lot even in the middle of the night and never found something always slow or always high. The network performance of Windows is always about 1/5 of the of the Linux machines (iPerf) Internode is slower for both but around the same 1/5 deference applies. Even now, running fine, that is the case trying several types of vNics and drivers and driver versions during the issues.
Adding a node to the system makes me think it is not the network between the nodes but i have no insight in that.
The nodes it self looked fine trough the eyes of the supplier (and me with to little experience) as there monitoring didn't give a warning. And they only added the node after 3 weeks of problems. So there we no direct indicators the system was full imo.
We have a subscription but i don't know how far the supplier used that. As a lot of times i got the answer that we would not get help with our "Windows" problems.
Last thing we did at the end of the year is buy a "home" machine with NVMe and installed Proxmox. That machine is for sure a lot faster than my profesional systems. A Windows AD with full Exchange boots in seconds and a login with mgmt tools loaded is ~3 seconds. I don't even get the with a fresh Windows install on the new empty nodes (using the Ceph storage nodes underneath not NVMe). I understand it is not a real comparison but it gives me the feeling there is room for (big) improvements somewhere.
My goal is to know when the system is getting to full capacity and i understand that i have to answer a lot of questions that i have to relay to the supplier to check if those thing are already monitored.
The other thing is, that i would like to find out why my home system is so much faster.
Thanks for reading and i will try to reply your questions as soon as i can will in production