Okay, so the overall layout will be something like this:
Code:
┌──────────┐
│ Room 3 │
│ ┌──────┐ │
┌──────┤ │Node 5│ ├──────┐
│ │ └──────┘ │ │
│ │ │ │
│ └──────────┘ │
┌────┴─────┐ ┌─────┴────┐
│ Room 1 │ │ Room 2 │
│ ┌──────┐ │ │ ┌──────┐ │
│ │Node 1│ │ │ │Node 3│ │
│ └──────┘ │ │ └──────┘ │
│ ├────────────┤ │
│ ┌──────┐ │ │ ┌──────┐ │
│ │Node 2│ │ │ │Node 4│ │
│ └──────┘ │ │ └──────┘ │
│ ... │ │ ... │
└──────────┘ └──────────┘
If you use room, site or datacenter as separation level in the crush map, doesn't really matter, as long as you are consistent.
To create the hierarchy in the Crush map, first create the rooms:
Code:
ceph osd crush add-bucket {room name} room
Then move the rooms underneath the root (default) of the Crush map:
Code:
ceph osd crush move {room name} root=default
And last, move the nodes into their respective rooms:
Code:
ceph osd crush move {node} room={room}
After that, you should the the hierarchy in the Ceph->OSD panel, or if you run the
ceph osd df tree
command.
Then edit the Crush map (the Stretch cluster documentation covers how), and you can either use the Crush rule that they have in there. The following also works:
Code:
rule replicated_room_host {
id {RULE ID}
type replicated
step take default
step choose firstn 0 type room
step chooseleaf firstn 2 type host
step emit
}
What both do, is to make sure, that 2 replicas are stored per site/room.
The pools will need a size/min-size of 4/2.
Now, Proxmox VE specifics: we don't want any VMs to end up on the tie-breaker node. There are a few ways to accomplish that. You could limit the storages to the non-tie breaker nodes. This way, even if HA would place a node on the tie-breaker node, it won't be able to start, and HA will relocate it to another node.
Additionally, you can make use of HA groups to "restrict" the VMs to the nodes in each room/site. By giving out priorities, you can manage on which half of the cluster the VMs should run preferably.
So you could have 2 HA groups with the priorities set higher for one or the other room.
After you have set it up, you can confirm that the replicas are split between the room by investigating the placement groups.
ceph pg ls
will list all PGs and the OSDs they are on. You can then verify that the OSDs in use, are as expected, 2 OSDs from room 1, and 2 from room 2.
And before you take it into production, test the different failure scenarios. Including failurs in the network connection between the two locations in all possible variants. So you get a good understanding how the Proxmox VE + Ceph cluster behave. Some half broken networks constellations can for example show itself in weird behavior of Ceph.