Sizing question, dual data center

luphi · Feb 15, 2019

Hey all,

I have to size a PVE/CEPH environment for two data center.
We need a new home for roughly 300 small VMs (4 cores, 4GB Memory,100-200GB storage)
I calculate half a year until all 300 VMs are migrated and calculated 100% growth in the next three years.
Storage bandwidth should not be less than a single local spinner for each VM.
One more thing: I have to rely on HP hardware.

Based on these requirements, I sized the following three types of servers:

PVE
------
DL360Gen10
2x Xeon 6130 (16core)
512 MB Memory
2x 240 GB SATA M.2 mixed r/w
2x 10GE Ethernet

OSD
-------
DL380Gen10
1x Xeon 6130 (16core)
64 GB Memory
2x 240 GB SATA M.2 mixed r/w
24x 2,4 TB SAS SFF HDDs
2x NVMe SSD
2x 10GE Ethernet
BBU

MON/MDS
---------------
DL20Gen10
1x Xeon E2124
16GB Memory
2x 240 GB SATA M.2 mixed r/w

For each DC, I would take:
4x PVE, 5x OSD 3x MON/MDS

Each DC gets it's own independant cluster
One ceph pool to store the VM disks
One cephFS to store backups from the other DC.

In the worst case scenario of a complete DC outage, I have to manually (or via API script) restore the missing VMs.

My open topics so far are:
1. Will Proxmox support the hardware (DL20 includes an S100i controller)? I hope in HBA mode it will work.
2. Which caching strategy shall I use?
3. Since both clusters are independant, how to aviod duplicated VM ids in case of restore?

Every comments, ideas, recommendations, concerns about sich approch or to my open questions are highly apprechiated.

Cheers,
Martin

wolfgang · Feb 15, 2019

Hi,

luphi said:
Will Proxmox support the hardware (DL20 includes an S100i controller)? I hope in HBA mode it will work.

Proxmox does only support HBA, if your controller supports an HBA mode it is fine.
But keep in mind JBOD is no HBA.

luphi said:
2. Which caching strategy shall I use?

it depends on your applications.

luphi said:
3. Since both clusters are independant, how to aviod duplicated VM ids in case of restore?

If they are independent there is no need of take care of that.
Generall if you restore a VM from a backup you can use an alternative VMID.
The main problem I see is what is the current state(newest).

To your HW setup.
I would recommend you to use a 40GBit network for the ceph nodes (mon/mgr and osd) because 10GBit is the latency to hight for a small cluster like this.
With 10Gbit you will get a bottleneck.

I guess the NVMe is used for the WAL/DB?
If so you have to use something in the class of an Intel DC P4800x to accelerate 12 OSD per NVMe.
I personally would consider using SSD instead of HDD,
because the price of this HDD is very near the SSD and with SSD you have more IOPS and lower latency.
Also, with SSD you can save the NVMe.

luphi · Feb 15, 2019

Hello Wolfgang,

thank you for your quick estimation.

But keep in mind JBOD is no HBA.

Mhh, I thought a HBA just supports JBOD configuration
HP SmartArray Controller support RAID and HBA mode

The main problem I see is what is the current state(newest).

I'm not sure, what you mean.On DC1, i would like to configre a daily backup job of all VMs using the cephFS storage of DC2
I case of a desaster, I can restore them in DC2 by loosing the last day's data, which is acceptable in my case.

I would recommend you to use a 40GBit network for the ceph nodes (mon/mgr and osd) because 10GBit is the latency to hight for a small cluster like this.
With 10Gbit you will get a bottleneck.

I started the sizing from the network perspective. My plan was to use 2x10GB in an LACP configuration. With six OSD nodes I would then expect 120Gbit/s throughput on storage side. If I sum up on the PVE side, I will have 80Gbit/s. So, if the network will be the bottlneck, I expect it to be on the client side. Please correct me, if I'm wrong here. I don't have your expiriences in such setups, but 80Gbit/s sound like a good starting point to me or is the latency a real show stopper here? Do the MON/MGR also need 10GE or 40GE? I though they are noch so high loaded.

I guess the NVMe is used for the WAL/DB?

Yes, that was my initial idea. I calculated as follows: If I expect a single spinner is able to write 150MB/s * 24 disks / 3 replicas = 1200MB/s speeded up by the NVMes should be similar to the 20Gbit/s network bandwidth. Is that too foolish?

If so you have to use something in the class of an Intel DC P4800x to accelerate 12 OSD per NVMe.
I personally would consider using SSD instead of HDD,
because the price of this HDD is very near the SSD and with SSD you have more IOPS and lower latency.
Also, with SSD you can save the NVMe.

Currently I'm still waiting for a quote for the above setup, so no idea which NVMes are effectifly offered. Relying upon an older quote from a month ago, the price for a SSD is three times higher then the price for a spinner (per TB). I think, that will justify the NVMes. regarding IOPS and latency: Will that not be coverd by the NVMes?

I hope my assumptions are not too far away from reallity and is the latency between 10GE and 40GE interfaces really such noticable?

guletz · Feb 15, 2019

Hi,

For any dual DC setup like your, one most important task is the backup to be move from DC1 to DC2, and in reverse. I guess that you will copy your backups ... as files?

And during this process ( out of works hours) what will be if you can not finish this task until morning? And the next day one of your DC is offline? Or maybe your ceph is broken next day.

What services will be hosted on this 300 VM? All of this VM are identical as services?

luphi · Feb 15, 2019

Hello guletz,

I guess that you will copy your backups ... as files?

Yes, I want to use the Proxmox internal backup solution and use the snapshot mode.

And during this process ( out of works hours)

There are no off work ours. But it's okay to have a short break for the snapshot, since they are not doing it all at the same time.

And the next day one of your DC is offline? Or maybe your ceph is broken next day.

That's why I don't wan't to spread the cluster acrross both DCs. I want to have two separated ceph clusters.
Each cluster with two pools, one RBD to host the VMs and one cephFS to store the backups from the other DC.
Each PVE cluster will mount both cephFS, the remote on to store the backups and the local one in case of restores.

What services will be hosted on this 300 VM? All of this VM are identical as services?

Three or four different kind of appliances.

Cheers,
Martin

alexskysilk · Feb 15, 2019

wolfgang said:
I would recommend you to use a 40GBit network for the ceph nodes (mon/mgr and osd) because 10GBit is the latency to hight for a small cluster like this.
With 10Gbit you will get a bottleneck.

Just wanted to mention that 40gbit and 10gbit connections have identical latency due to 40gbit being 4x10gb multiplexed... if you want faster latency you'd need to move to 25gbit ethernet.

udo · Feb 15, 2019

luphi said:
Hello Wolfgang,

thank you for your quick estimation.

Mhh, I thought a HBA just supports JBOD configuration
HP SmartArray Controller support RAID and HBA mode

I'm not sure, what you mean.On DC1, i would like to configre a daily backup job of all VMs using the cephFS storage of DC2
I case of a desaster, I can restore them in DC2 by loosing the last day's data, which is acceptable in my case.

I started the sizing from the network perspective. My plan was to use 2x10GB in an LACP configuration. With six OSD nodes I would then expect 120Gbit/s throughput on storage side. If I sum up on the PVE side, I will have 80Gbit/s. So, if the network will be the bottlneck, I expect it to be on the client side. Please correct me, if I'm wrong here. I don't have your expiriences in such setups, but 80Gbit/s sound like a good starting point to me or is the latency a real show stopper here? Do the MON/MGR also need 10GE or 40GE? I though they are noch so high loaded.

Yes, that was my initial idea. I calculated as follows: If I expect a single spinner is able to write 150MB/s * 24 disks / 3 replicas = 1200MB/s speeded up by the NVMes should be similar to the 20Gbit/s network bandwidth. Is that too foolish?

Hi,
unfortunality with HDD-OSDs I assume you will not reach your caclulated Values - the latencies are added and the performance isn't such high...

What kind of disk do you want to use?

Time ago I had an ceph-cluster which starts with 4 OSD-Nodes (each with 12 * 4TB spinners (+ Journal-SSD) connected via 10GB-SFP+ Ethernet). Witch each added node the performace get better (leave the company at 8 OSD-Nodes).
But I never reach 1200MB/s writing (or reading) speed!! But this depends on access type too - with many similtanous access the values got better. In our case few VMs do mostly all IO (fileserver).

Udo

wolfgang · Feb 18, 2019

luphi said:
Mhh, I thought a HBA just supports JBOD configuration
HP SmartArray Controller support RAID and HBA mode

JBOD can use the memory(cache) on the raid card what can be a potential problem with ceph.
HBA mode uses the disk direct (passthrough). So there is no RaidCard optimation.

luphi said:
I'm not sure, what you mean.On DC1, i would like to configre a daily backup job of all VMs using the cephFS storage of DC2
I case of a disaster, I can restore them in DC2 by losing the last day's data, which is acceptable in my case.

I would recommend you to use a rbdmirror and sync the guest direct and make the vzdump on DC2 later.
If you copy the vzdump you must copy always the whole image. when you use rbdmirror you only sync the diff.
Also in failover case, the guest can start instant, because you don't need to extract them.
See http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/

luphi said:
I don't have your expiriences in such setups, but 80Gbit/s sound like a good starting point to me or is the latency a real show stopper here? Do the MON/MGR also need 10GE or 40GE?

The main problem in the ceph network is latency. This applies for MON/MGR and OSD.
You have to consider for every write you have to go in the cluster 4 times over the network and the client must go 2 times over it.
The latency problems also affect the Disks why ssd are better than spinners.

luphi said:
Yes, that was my initial idea. I calculated as follows: If I expect a single spinner is able to write 150MB/s * 24 disks / 3 replicas = 1200MB/s speeded up by the NVMes should be similar to the 20Gbit/s network bandwidth. Is that too foolish?

This does not work this way because you will have sync writes and only high-end NVMe Disk can handle this.
This is no more actual but you get an idea about it how slow NVMe can be in sync mode.
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

luphi said:
I hope my assumptions are not too far away from reallity and is the latency between 10GE and 40GE interfaces really such noticable?

See our ceph benchmark paper.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
Here you see the difference with the same setup on 10GBit to 100 GBit network.
We do not have a 40GBit Network, but I read that 40GBit network is in terms of latency near on 100GBit then on 10Gbit.

luphi · Feb 18, 2019

JBOD can use the memory(cache) on the raid card what can be a potential problem with ceph.
HBA mode uses the disk direct (passthrough). So there is no RaidCard optimation.

Thank you for clarification.

I would recommend you to use a rbdmirror and sync the guest direct and make the vzdump on DC2 later.
If you copy the vzdump you must copy always the whole image. when you use rbdmirror you only sync the diff.
Also in failover case, the guest can start instant, because you don't need to extract them.
See http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/

Seems to be a good aproach. Can I simply copy over the VM config files and do the vzdump on the synced images?

Just read https://forum.proxmox.com/threads/rbd-mirror-support.33298/
Not such easy, if the ceph clusters are maintained by Proxmox. I will have a deeper look and testing.....

The main problem in the ceph network is latency. This applies for MON/MGR and OSD.
You have to consider for every write you have to go in the cluster 4 times over the network and the client must go 2 times over it.
The latency problems also affect the Disks why ssd are better than spinners.

I'm afraid, I have to rely on spinners :-(, hopfully I can get some the P4800X)

Another question:
How to make sure VMs will not be deployed to the CEPH nodes?
Shall I build different clusters for PVE and CEPH, or is there a smarter way?

Cheers,
Martin

wolfgang · Feb 18, 2019

luphi said:
Can I simply copy over the VM config files and do the vzdump on the synced images?

Yes

luphi said:
I will have a deeper look and testing.....

Yes you should ;-)

luphi said:
How to make sure VMs will not be deployed to the CEPH nodes?
Shall I build different clusters for PVE and CEPH, or is there a smarter way?

You should really use two independent clusters. One for ceph and one for the guests.

luphi · Feb 18, 2019

ok, thanks to everyone for your input, especially to Wolfgang. It's really appreciated.
Hopefully I'll get the quotes soon.
Keep you posted.

Cheers,
Martin

Search

Search

Sizing question, dual data center

luphi

Renowned Member

wolfgang

Proxmox Retired Staff

luphi

Renowned Member

guletz

Distinguished Member

luphi

Renowned Member

alexskysilk

Distinguished Member

udo

Distinguished Member

wolfgang

Proxmox Retired Staff

luphi

Renowned Member

wolfgang

Proxmox Retired Staff

luphi

Renowned Member

We value your privacy