Homelab migration to Proxmox

trmg · Aug 25, 2021

Hello,

I am new to Proxmox and am looking to migrate my home lab over. I'm currently running a 3-node hodgepodged together libvirt/KVM/gluster stack. I'd like to move this to Proxmox using Ceph. However, I'd need to juggle this around existing hardware. Here's what I'm thinking of doing:

1) Break the existing cluster and install Proxmox on one box.
2) Migrate/rebuild all VMs over to the Proxmox host.
3) Install Proxmox on remaining nodes
4) Join them all together as one happy Proxmox cluster with Ceph backed storage.

Is this possible? I guess the main question is: can I start a Ceph storage pool on a single box and then grow it to the remaining two later?

Also, if this is documented/detailed somewhere that I have not found, I apologize in advance!

Thanks!

aaron · Aug 25, 2021

Your plan is pretty much okay.

A few things though:
If you join a node to a cluster, it needs to be empty -> create the cluster on the first node to which you will migrate the VMs to initially and then join the others.

While it is technically possible to run Ceph on a single node, it is not as simple because you need to change quite a few things to make that work as it goes against the normal use case of Ceph. I would recommend to use another (temporary) storege for the migration on which you can place the VMs first and once you have set up the Ceph cluster, you can move the VM disks to the Ceph storage.

And as always with things like that, make sure to have backups of you VM disks around (and the basic configs) should anything go wrong

trmg · Aug 25, 2021

Gotchya. The temporary storage was something I was hoping to avoid, however I understand why it's recommended. So, this got me thinking and I *think* I have a plan.

Since with Ceph I'd be presenting it the disks raw, assuming the hardware will support this...

I can take two of the drives of the first node and put them in a traditional RAID 1, and have the controller act as a HBA for the remaining drives. Get Proxmox set up and use the RAID 1 array as temporary VM storage. Once everything is migrated over, get Proxmox installed on the remaining two nodes and get everything cluster-fied. Get the remaining drives on the first node plus the drives of the 2nd and 3rd node Ceph-ified. Move the virtual disks over to the Ceph pool. Once all virtual disks have been moved and everything seem to be working as expected, tear down the RAID 1 on the first node and then add those two disks to the Ceph pool.

Obviously I'd keep backups of VMs as necessary jusssssst in case things go south.

Totally crazy? Or do I have a chance at being successful? :-D

Astraea · Aug 25, 2021

I have not yet implemented Ceph in my production homelab yet but I am tinkering with it in a test environment. However, I have rebuilt my lab more than a few times while keeping the minimal required VMs running.

What I Have done is taken one server from my production setup and reinstalled Proxmox from scratch and created the required networks and local storage on that server to run my VMs. I then stopped the required VMs still in production servers and then done a backup. Transferred the backup to the newly created server and then restored the VM(s). Now I can rebuild or whatever needs to be done on the other servers without affecting the VMs that need to stay running.

Once I am all done I follow the steps from before in reverse. I do a backup from the lone server, migrate the backup to the new production servers and then restore the VMs. Finally, I add the lone server into the production server cluster and rebalance the workload, and I am sure you don't need to do this but I always reinstall Proxmox on that lone server before adding it into the cluster to make sure it was configured the same as the other servers and to make sure it is ready to be added into the cluster.

aaron · Aug 26, 2021

Hmm okay, if you migrate everything to one node first, create a new Ceph cluster with the remaining two nodes, then migrate the VMs over to them, and then finally recreate the single node and join it to the cluster, it should work.

As long as you only have two nodes, there will be no redundancy and Ceph will throw warnings, but it should work.

oz1cw7yymn · Aug 26, 2021

As mentioned, it does work to have ceph on one node, but you would need to tweak the pools and rules in a way that you probably shouldn't if you don't know what you're doing.

Your proposed way is perfectly fine, as aaron noted that Ceph might still complain. Also, depending on the sizes and number of the disks, you might end up in a situation where the pool is full even though you have plenty of space left on the disks. If the rule wants three copies on three different nodes, and one node is full, it doesn't matter that there's plenty of space on the other nodes. So you might still need to force ceph a bit in the beginning until you have all the disks available in the ceph cluster.

trmg · Aug 26, 2021

Thanks for the feedback everyone! Sounds like this is totally doable and I'm excited to get started!

When it comes to the Ceph pool, I was planning on configuring it for 2 copies. The idea being that the cluster only needs to survive losing a single node. This is what I do currently with gluster (2 copies plus arbiter) and it has been pretty resilient to the various ways I've brought the rack down (intentionally and, uh, unintentionally, heh). This way the cluster benefits from sharing some of the space of the other members for "moar storage" and there isn't a single point of failure. Of course, this means there will be reads happening across the cluster, but this should be fine as each member has dual 10G links to the network.

The example setup seems to indicate setting the pool to keep 3 copies. Is there anything inherently wrong with the above?

aaron · Aug 27, 2021

oh, @oz1cw7yymn brings up a good point. If your whole cluster consists of only two nodes, but you still have the default size/min_size of 3/2, it will try to create the 3rd replicas on the two nodes -> unexpected data usage. So until you have added the third node, make sure that the size is only set to 2.

Overall, the discussion about the size/min_size is a different one as it applies to normal operation and not the transitioning phase.

Usually, people will then go ahead and set it to 2/1 instead of 3/2 so that the affected pools stay writable.
If you have it set to 2/1, you run into potential problems, mainly if one node fails and Murphy's Law hits you. For example, if the other node dies as well or just the OSD itself. There are other issues that could come up as well:

With replication 2, you usually have two problems:

if you find an inconsistent object you have some hard time to tell which copy is the correct one

if you have flapping OSDs, i.e. osd.0 goes down, osd.1 is up and acting and accept writes. At the next moment, osd.1 goes down and osd.0 comes up. osd.0 becomes primary, but the PG is ‘down’ because osd.1 had the last data. In that case, you need osd.1 to come back so that the PG will work again.

From: https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Oder Punkt 9, Slide 14: https://www.slideshare.net/ShapeBlue/wido-den-hollander-10-ways-to-break-your-ceph-cluster

TL;DR: If you like your data, you use 3/2

trmg · Aug 27, 2021

This makes sense. I have to think about it a bit differently since the setup would be different than my existing gluster pool.

With my current gluster setup, each host has its own RAID array, so a single disk failure in a given node does not result in any lost redundancy until an entire node disappears. If I do Ceph in the recommended configuration with the controllers passing the disks directly to the underlying system, a given copy of data will reside on a single disk. So two copies would mean the data is on two disks, and if both disks become unavailable, that'd be Bad News (TM). I see why a minimum recommended replica count is 3.

In my setup, I have 3 servers where one server has 6 disks and the other two servers have 4 disks each. They're all 2TB drives, so ~28 TB raw space. How does this translate to usage with Ceph? Will the two extra disks in the 6 disk server be of any use? Or will the Ceph pool be effectively ~8 TB? My gut says I should just pull the two extra disks to each server is 4 2 TB disks each. Or could they be used for other Ceph related purposes potentially?

Sorry if my questions are starting to go off-track. If it's better for me to start a new thread I can do that.

oz1cw7yymn · Aug 27, 2021

Yeah - it's really important to understand how ceph is different from other redundancy methods.

Nodes do not need to be the same number of disks or the same size of disks, but you have to know what happens with the ceph rules. So say if you have two copies of everything on two different nodes (like @aaron mention, not the best idea), and three nodes - each piece of data can spread over any two nodes with space left. So if node 1 has 12 TB and 8 left, node two is full and node three has 2TB left, data can still be stored on node 1 and 3 until node 3 is full (simplified - you don't want to have your nodes full!).

Note that the (replication) rule for a pool has

size = the number of copies the system wants
min_size = the number of copies the system must have to serve any data
failure_domain_type = the type of equipment the copies cannot be on the same of. So if your failure domain is node, the (size) copies need to be on different computers. You can set the failure domain to osd (disk), rack, datacenter etc. allowing any type of security on your ceph cluster

Basically, a RAID-1 is a rule with size=2, min_size=1 and failure_domain = osd/disk. Ceph is made to be much more resilient than that, and by default expects more equipment, but that's the choice of the system administrator.

trmg · Aug 30, 2021

Hmm. So, I just acquired two IBM EXP2512 DAS shelves. I'm trying to think of how to integrate these into my setup. Given that I have two shelves, not three, poses a bit of a challenge.

I was reading about how one can build a 2-node Proxmox cluster and have a 3rd system participate as a quorum vote device only. Assuming this is a remotely good idea, could I then take this a step further and do Ceph between the two "real" nodes with replication 4/2 or 4/3. The idea being that there would be two copies of a given block on both DAS shelves (one attached to each "real" node). This way if one of the real nodes goes down, there is still some redundancy left.

Another thing I'm considering trying is connecting a DAS shelf to two hosts. I've read this should be doable, but because the setup is simply a HBA JBOD type setup I'm not sure how I'd get data redundancy. The idea being I could use one DAS shelf for VM disks and the other DAS shelf to store backups.

Curious on what y'all think?

oz1cw7yymn · Aug 30, 2021

Now you're really taking on a lot of unsupported scenarios. Proxmox can do 2 + quorum, but that is not a good idea with ceph. You can do a one node ceph (with a lot of tweaking and going outside recommendations) and you can do 3+ nodes (preferably many more, ceph likes more nodes over bigger nodes) - but a two node ceph will require more advanced rules than I could advice on.

Not recommended, here is more info on those rules.

Regarding your DAS enclosures, everything is possible but not recommended. My advice, get another one or don't use them for ceph. Saves a lot of unsupported hassle.

trmg · Aug 30, 2021

I mean, someone has to push the envelope...heh.

I am not tied to Ceph specifically. It was my original plan as per my setup, but these disk shelves have thrown me a curve ball, heh. I may look into sourcing a 3rd disk shelf, but that would be a future thing unfortunately. I would like to use them if I can.

One option I was considering is doing a 2-node plus quorum pool and then ZFS with replication, but any sort of replication I'd ideally want to happen in real time so that I can live migrate things around as maintenance is needed.

If I forego the ability to live migrate, I could just do a standalone host and have the other host run Proxmox Backup Server. It would leave less flexibility when host maintenance is needed (a host reboot would require downtime), but I'd have some level of data protection if the VE host was to bite the dust.

Another option is to see what I can/cannot do when both hosts are connected to the same DAS shelf.

I'm just brainstorming possible avenues here. Since it's a home lab, I'm ok pushing the envelope a little as long as I'm not completely setting myself up for failure.

oz1cw7yymn · Aug 31, 2021

I run a one-node ceph cluster for a particular use-case myself so my point with unsupported scenarios is that you'll likely not get a lot of help from people with a lot of experience, so you'll have to get that experience yourself

I don't know anything about ZFS.

Good luck!

trmg · Aug 31, 2021

That makes complete sense.

I think what I may end up doing is simplifying the setup as follows:

One node running Proxmox VE paired with one DAS shelf.
One node running Proxmox Backup paired with one DAS shelf.

I actually do not have a great backup strategy at the moment, so I think the above will be a good trade-off. In the future if I come across more/different hardware that will allow me to re-visit clustering I can do that then. I will miss being able to live-migrate VMs around to perform host maintenance, but having the ability to do historical backups (even if they remain on site) will be nice.

Search

Search

Homelab migration to Proxmox

trmg

Member

aaron

Proxmox Staff Member

trmg

Member

Astraea

Renowned Member

aaron

Proxmox Staff Member

oz1cw7yymn

Well-Known Member

trmg

Member

aaron

Proxmox Staff Member

trmg

Member

oz1cw7yymn

Well-Known Member

trmg

Member

oz1cw7yymn

Well-Known Member

trmg

Member

oz1cw7yymn

Well-Known Member

trmg

Member

We value your privacy