Hello everyone,
I'm reaching out for the community for some pointers regarding an educational project I got going on.
Description: I'm building a cyber security range, which is a virtual environment where students can practice various security skills, the most interesting for our topic being attack & defense. For short, this means that I need to be able to quickly spin up anything between 10 to 50 linked clones at a time. These clones have various different base templates, either Linux or Windows. The VMs themselves are short lived, tipically from a couple of hours to a couple of days.
The setup that I've imagined for this is the following:
Host configuration (3-5 hosts, depending on what price I get from my hardware vendor):
- CPU: 1 x AMD Epyc 7542 (32c/64t)
- RAM: 1TB
- storage:
- 2 x M2 RAID1 for OS
- 8 - 10 x 1 TB SSD SATA-3 drives in JBOD/HBA/IT mode for Ceph OSDs
- network:
- 2 x 10/25 Gbps NICs for Ceph replication
- 2 x 10/25 Gbps NICs for VM traffic
For networking gears I'm thinking of going with 2 x 10/25 Gbps switches, probably Cisco Nexus, but I'm willing to consider other vendors as well.
And now come the questions:
1. I'm thinking of building a hyper-converged CEPH cluster out of all of these boxes. From everyone's experience, is this feasible? Do consider that these are not "production" VMs, so a *slight* penalty in performance is understood and accepted.
2. Since budget is always a constraint, I'm in favour of more nodes (with smaller RAM and / or number of SSDs and/or smaller SSD capacity) rather than 3 big ones, both from CEPH's point of view and considering that the load will be distributed across multiple nodes. Is this a good idea?
3. Is there any limitation in the numbers of VMs I can provision at the same time? I've done some testing with a small 3 node cluster and occasionally I would get en error back `TASK ERROR: clone failed: cfs-lock 'storage-vm-storage' error: got lock request timeout`, but using Ansible's retry capabilities I was able to circumvent it. Nevertheless, I would like to know from everyone's experience if there are some limitations here.
4. Last, but not least, on some occasions, when deleting VMs, their cloud-init drives would get "left behind" on the CEPH storage, but I've only noticed this when deleting 20-30 VMs at a time. For both points (3) and (4) I've been using Proxmox VE 7.4-16.
Thanks in advance!
I'm reaching out for the community for some pointers regarding an educational project I got going on.
Description: I'm building a cyber security range, which is a virtual environment where students can practice various security skills, the most interesting for our topic being attack & defense. For short, this means that I need to be able to quickly spin up anything between 10 to 50 linked clones at a time. These clones have various different base templates, either Linux or Windows. The VMs themselves are short lived, tipically from a couple of hours to a couple of days.
The setup that I've imagined for this is the following:
Host configuration (3-5 hosts, depending on what price I get from my hardware vendor):
- CPU: 1 x AMD Epyc 7542 (32c/64t)
- RAM: 1TB
- storage:
- 2 x M2 RAID1 for OS
- 8 - 10 x 1 TB SSD SATA-3 drives in JBOD/HBA/IT mode for Ceph OSDs
- network:
- 2 x 10/25 Gbps NICs for Ceph replication
- 2 x 10/25 Gbps NICs for VM traffic
For networking gears I'm thinking of going with 2 x 10/25 Gbps switches, probably Cisco Nexus, but I'm willing to consider other vendors as well.
And now come the questions:
1. I'm thinking of building a hyper-converged CEPH cluster out of all of these boxes. From everyone's experience, is this feasible? Do consider that these are not "production" VMs, so a *slight* penalty in performance is understood and accepted.
2. Since budget is always a constraint, I'm in favour of more nodes (with smaller RAM and / or number of SSDs and/or smaller SSD capacity) rather than 3 big ones, both from CEPH's point of view and considering that the load will be distributed across multiple nodes. Is this a good idea?
3. Is there any limitation in the numbers of VMs I can provision at the same time? I've done some testing with a small 3 node cluster and occasionally I would get en error back `TASK ERROR: clone failed: cfs-lock 'storage-vm-storage' error: got lock request timeout`, but using Ansible's retry capabilities I was able to circumvent it. Nevertheless, I would like to know from everyone's experience if there are some limitations here.
4. Last, but not least, on some occasions, when deleting VMs, their cloud-init drives would get "left behind" on the CEPH storage, but I've only noticed this when deleting 20-30 VMs at a time. For both points (3) and (4) I've been using Proxmox VE 7.4-16.
Thanks in advance!