Replacing system drive in Ceph node

troycarpenter

Renowned Member
Feb 28, 2012
100
7
83
Central Texas
Lots of questions now that I've got some decent hardware and upgrading to 6.0. Per a discussion in another thread, I would like to move the OS of my Ceph Nodes from a default LVM-based install on a large SSD (like 2 TB) ideally to a RAID 1 ZFS boot disk on much smaller SSDs (256GB). I'm fully expecting that the system will be down while I reinstall the OS on the ZFS drives, but once the system comes back up and I force it back into the cluster, how do I get the OSDs back online? All the LVMs are still there, and the OSDs show in the CRUSH map, but because the OS has been reinstalled, the "new" node doesn't have all the same OSD links and startup files as before.

I'm sure this is similar to a common problem when the OS drive of a Ceph node fails. How do I reinstall and regain use of the OSDs in the system?

In my case, I can install Proxmox on the ZFS RAID in another server, and even copy data from the original system drive to minimize downtime.
 
I'm sure this is similar to a common problem when the OS drive of a Ceph node fails. How do I reinstall and regain use of the OSDs in the system?
The disks are checked by the ceph-osd.target and mounted accordingly. The bigger part is to destroy the MON first and add it after the re-installation, as the MON DB is on the root disk. As you said another way would be to just copy over /var/lib/ceph/ and create the systemd links for the services. But I would go with the first approach, less error prone. As good measure you can set the norecover & norebalance flags, but as you have three nodes it has less of an effect, as Ceph can't recover with 3 replicas.

EDIT: you can always try the re-installation in VMs. ;)
 
Does the new ceph-volume command help any? It seems to imply that it will find already created OSDs and configure them, but maybe I'm reading too much into that.

Right now the best course of action I've got to prevent downtime to the VMs is to move all VM images back to local storage so I can work on the Ceph nodes without locking up the VMs.

Also, forgive my ignorance, but what do you mean by the statement with only three nodes Ceph can't recover with 3 replicas? Are you just saying that the system won't try to recover when one node is down since all three need to be up for 3 replicas?
 
Does the new ceph-volume command help any? It seems to imply that it will find already created OSDs and configure them, but maybe I'm reading too much into that.
The ceph-volume is the tool to provision disks as OSDs. But the ceph-volume-systemd helper is activating the OSD on service start. The ceph-osd@X services are dynamically created. It was made with the intention of making OSDs portal within the same cluster.

Also, forgive my ignorance, but what do you mean by the statement with only three nodes Ceph can't recover with 3 replicas? Are you just saying that the system won't try to recover when one node is down since all three need to be up for 3 replicas?
Yes, exactly.
 
The ceph-volume is the tool to provision disks as OSDs. But the ceph-volume-systemd helper is activating the OSD on service start. The ceph-osd@X services are dynamically created. It was made with the intention of making OSDs portal within the same cluster.
What I'm trying to figure out is whether or not I can re-install Proxmox (and Ceph) on a Ceph node with existing OSDs, force a rejoin to the Proxmox cluster, then have the system recognize, configure, and start the OSDs on that node. I then expect there will be some recovery since the OSDs will have been unavailable for a short time. I am willing to do manual work to help that happen.

My experience with the first node I tried is that once the Ceph node was back in the cluster, it wouldn't start the OSDs because the ceph-osd@X services were not recreated. I could destroy and re-recreate the OSD and it would come online, but I couldn't (probably because I didn't know the right CLI incantation) make the system recognize and configure the existing OSDs.

To minimize downtime for a node, I have an extra compute server that hasn't been configured yet. I can use it to create the new zfs system disks for the Ceph nodes and get everything to the point where it rejoins the cluster. From there I can shutdown the target Ceph node, replace the system drive(s) and boot. At that point I force it to rejoin the cluster which all works. The last step is to get the existing OSDs back on line and that's the part I'm missing.
 
Last edited:
I still never figured out how to get the newly installed system drive to recognize the OSDs after installation. However, that really wasn't too much of an issue since I was also going to have to recreate the OSDs anyway to get them in the new LVM configuration used in Nautilus.

The method I used is to create the ZFS boot drives in another system, using the same system name during the install but a different IP. Once up I fixed the sources.list.d entries, I did a dist-upgrade, and installed ceph. I copied /etc/ssh and /root/.ssh to the new server, as well as the interfaces file. In Proxmox I deleted the mon/mgr/mds from the server to be replaced, then shut down the server. I replaced the system drive with the new ZFS drives and booted. I then forced the new install to join the cluster. From there it was recreating the OSDs one at a time at a metered pace.

Let the system rebalance and repeat for the next Ceph server.
 
This is what I use when reinstalling OS drive on Ceph nodes and OSDs are not automatically recognized. This is an OSD by OSD process so take your time to avoid any mistake and lose any OSDs:

1. Reinstall Proxmox as usual.
2. Find out which OSD does the drive /dev/sdX belongs to. For example, let's say the first drive /dev/sda is an OSD. Mount the drive on a location and read the content of whoami. This is will show you the OSD in numeric form:
Code:
$ mount /dev/sda /mnt/temp-osd
$ cat /mnt/temp-osd/whoami
1
3. Manually create the osd directory:
Code:
mkdir /var/lib/ceph/osd/ceph-1
4. Run the following command to mount:
Code:
$ ceph-osd -f -i 1 --osd-data /var/lib/ceph/osd/ceph-1
5. Stop the OSD from GUI. It should auto start from its new location.

Since all existing OSDs still got your data, you can keep adding like this till the last drive. Cluster will automatically recognize/rebalance the data if necessary.
 
  • Like
Reactions: flames

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!