Hardware requirements advice for more than 100 VMs

Some thoughts on your scenario (sorry I also won't give hints for specific hardware):

* ever thought about using docker? You'd concentrate some of your VM into one (like 20 docker container on one VM) and could save a lot overhead of complete VM

* regarding stopped / frozen VM: You likely want to use some kind of watchdog which triggers a restart of VM if its not responding for a specific amount of time. So you could decrease the offline time and therefore decrease the amount of data to be parsed. This could save you a lot of money and ressources you would keep just for worsecase otherwise.

* I understand that your java software is threadable and therefore capable of using multi core. So I'd prefer more cores over higher speed. In PVE you can set 4 or 6 cores per VM and then limit the usage to half of it. E.g. 6 cores and a limit of 300% or probably 400%. So you can spread the load over more cores assuming, they are not all running the same time.
 
I need to determine what type of CPU to use, need to decide should I go with LXC or VM

As I tried to explain: 100 VMs with 2 CPUs is currently not possible in a performant way in a two-socket system. In a 4-socket system, you can have that amount of CPUs, but were in the 6-figures (price list numbers) only with 4 top-of-the-line XEONs.

You said that your VMs are almost identical. Is your software autoscaling over the number of machines or how is your workload spread across all nodes? In all such setup I've ever seen, you have a monitor that kills machines that take too much time to complete a task and reschedule the task to another node, assuming you have one global work queue.

Best approach for such a setup is to use horizontal scaling with e.g. CEPH and spread your workload across multiple nodes. If you need more, just add more (horizontal scaling) and your system can expand with your business/problem size. This is the IaaS point of view. @puldi is also right which the Docker view, that is more the PaaS-like view over your problem.
 
As I tried to explain: 100 VMs with 2 CPUs is currently not possible in a performant way in a two-socket system. In a 4-socket system, you can have that amount of CPUs, but were in the 6-figures (price list numbers) only with 4 top-of-the-line XEONs.

You said that your VMs are almost identical. Is your software autoscaling over the number of machines or how is your workload spread across all nodes? In all such setup I've ever seen, you have a monitor that kills machines that take too much time to complete a task and reschedule the task to another node, assuming you have one global work queue.

Best approach for such a setup is to use horizontal scaling with e.g. CEPH and spread your workload across multiple nodes. If you need more, just add more (horizontal scaling) and your system can expand with your business/problem size. This is the IaaS point of view. @puldi is also right which the Docker view, that is more the PaaS-like view over your problem.

8 machines (PC-s), on each machine is installed Proxmox, on each Proxmox there is only one node, in every node there are 16 VM's, every VM is cloned except the first one, difference between VM-s is IP, MAC and Java config file

1569414550973.png

Not sure what do you mean by autoscalling, but lets say that Java works like this, in database i insert data about device that will connect to Java app. Java is checking (refreshing) for new parameters and get it. Based on those parameters device will send data to Java app and then Java sends the results to remote database. Java is scaling when I add new device and that is manual work, not automatic, it is based on clients request.

So, in Java app there is a limit of devices that developer set, because of this overload problem, the more devices (connections) Java have, the bigger overload of CPU, RAM and SSD will be.

Tell me if you need more info
 
So, in Java app there is a limit of devices that developer set, because of this overload problem, the more devices (connections) Java have, the bigger overload of CPU, RAM and SSD will be.

Tell me if you need more info

Thank you. As mentioned before by others, consolidating everything into one box is actually worse than what you have now. I'd go with 3 nodes, CEPH and go the HA route. This has better availability and is easier expandable in the future. Depending on your current infrastructure, the price could be similar, because you don't need the biggest available CPU (28c), you can go with multiple core-less ones and still be faster, because the GHz and Turbo Speed goes down if you have more cores. You scale by the number of nodes so that you can maximise the cpu speed.

I'd get quotes for a big single server (or two, if you want to switch), and 3 nodes for CEPH and see what is moneywise better for you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!