Can I move a CEPH disk between nodes?

proxwolfe

Active Member
Jun 20, 2020
445
36
33
49
Hi,

I am running a small three node cluster with PVE and CEPH.

Soon I am going to replace the nodes with new hardware. In this context I am wondering whether it would be possible to move the CEPH disks from a replaced node to a new system.

If not, my guess is that I will need to add a new node with a new CEPH disk, out one of the old CEPH disks in the node to be replaced, rebalance everything to copy the data to the new disk and then remove the old CEPH disk. And all of that for each node to be replaced, while the data before and after will be the same. Just sounds like a lot of copying and time that I would like to save.

So can that be done?

Thanks!
 
Or, on a related note:

What is the best practice to replace a node in PVE and CEPH clusters?

If I first add a note, my cluster has an even number of nodes which might prevent it from having quorum. The same would be true, if I first removed one of the nodes.

So what is the right way to do it?

Thanks!
 
Can you tolerate downtime? Just be best to backup the VMs (with PBS perferably) and data and re-install PVE with Ceph Quincy.
 
If downtime is okay, then use backup/restore as @jdancer recommended.

If you want to have no downtime, then it will be a bit more tedious.
If you have completely new hardware, add it to the cluster. Create new MONs, MGRs and so on. Then remove these services from the old nodes one by one so that only the new nodes run these services.

Regarding the OSDs: if you have new OSD disks in the new hardware, it is just a matter of marking the OSDs in the old nodes as out, but let them run. Ceph will then migrate the data. Once the cluster is Health_OK again and the old OSDs have zero usage, you can destroy them.

If you want to reuse the OSD disks in the new nodes, you could destroy the OSDs in one node, move them to the new node and recreate them. Wait until Ceph is okay again and then do it again for the next node. You will run with reduced redundancy for that time, but functionality should not be impacted.

Then, once no Ceph services are running on the old nodes, remove them from the Proxmox VE cluster.
Check the CRUSH map afterwards (<node> -> Ceph -> Configuration) if there are some leftover buckets of the old nodes. If there are, you can remove them with ceph osd crush remove {bucket-name} See the Ceph docs

If I first add a note, my cluster has an even number of nodes which might prevent it from having quorum. The same would be true, if I first removed one of the nodes.

So what is the right way to do it?
As long as all nodes are up, you will have a quorum. The problem with an even number of nodes is, if the network splits up in such a way, that you have a "split brain" situation where each side will only have 50% of the votes.
To reduce the impact of such a problem during the migration, you could disable the HA stack by removing the guests from HA or setting them to "ignored".
 
If downtime is okay, then use backup/restore as @jdancer recommended.

If you want to have no downtime, then it will be a bit more tedious.
If you have completely new hardware, add it to the cluster. Create new MONs, MGRs and so on. Then remove these services from the old nodes one by one so that only the new nodes run these services.

Regarding the OSDs: if you have new OSD disks in the new hardware, it is just a matter of marking the OSDs in the old nodes as out, but let them run. Ceph will then migrate the data. Once the cluster is Health_OK again and the old OSDs have zero usage, you can destroy them.

If you want to reuse the OSD disks in the new nodes, you could destroy the OSDs in one node, move them to the new node and recreate them. Wait until Ceph is okay again and then do it again for the next node. You will run with reduced redundancy for that time, but functionality should not be impacted.

Then, once no Ceph services are running on the old nodes, remove them from the Proxmox VE cluster.
Check the CRUSH map afterwards (<node> -> Ceph -> Configuration) if there are some leftover buckets of the old nodes. If there are, you can remove them with ceph osd crush remove {bucket-name} See the Ceph docs


As long as all nodes are up, you will have a quorum. The problem with an even number of nodes is, if the network splits up in such a way, that you have a "split brain" situation where each side will only have 50% of the votes.
To reduce the impact of such a problem during the migration, you could disable the HA stack by removing the guests from HA or setting them to "ignored".

As a side issue - I'm using a 3-node cluster with a 4-bay esata chassis on each node - 12 total drives. A chassis dies and I move the drives to another node physically. GUI sees the drives on the new node as LV2_member. OSD In or Start doesn't work because the OSDs are on the old Node in Out status. Ceph is not balancing because it's expecting the drives to come back. Is a Destroy the only option? I was hoping for something like an Import or it recognizing the drives as OSD members.
 
You have to make sure the nodes and drives are IN THE EXACT POSITION they were in the original chassis- otherwise you'll end up with the wrong node accessing the drives.
Along the same lines, I have a similar issue:

I want to replace the Proxmox boot drive with a larger one. It looks like the easiest solution is to:
(a) migrate VMs/CTs to different nodes
(b) shut down machine and update HW
(c) re-install Proxmox on the new boot drive
(d) remove "old" instance of node from proxmox
(e) re-add "new" node to cluster

Now, the OSDs have not moved at all, however /var/lib/ceph/osd does not contain entries for them. Is the best solution to remove and re-add the OSDs (even though the data is already there) therefore causing a lengthy migration, or is there a way to re-create the /var/lib/ceph/osd directories? can I just backup the contents of the directories from the "old" drive instance and restore them in the "new" one?

Thanks!
 
You have to make sure the nodes and drives are IN THE EXACT POSITION they were in the original chassis- otherwise you'll end up with the wrong node accessing the drives.
Yes and no. You can mix the order of the discs in the chassis and it will find them. You can even change chassis if you need to for instance I was having problems with my eSATA chassis and put in a USB 3 chassis. Upon reboot it picks up all four ceph disks. The problem is when you are playing around with lots of hardware and accidentally put in an old disk you did not destroy properly. All of a sudden it has two OSD. 2's and you get a truckload of red errors. If you're quick enough to realize what's happening you can pull it otherwise you'll have some invalid blocks that have to shake out with deep scrub eventually if they ever do. Thankfully it's a lab and no real data was lost but I do enjoy the self-healing nature of the system where it migrates everything around
 
Along the same lines, I have a similar issue:

I want to replace the Proxmox boot drive with a larger one. It looks like the easiest solution is to:
(a) migrate VMs/CTs to different nodes
(b) shut down machine and update HW
(c) re-install Proxmox on the new boot drive
(d) remove "old" instance of node from proxmox
(e) re-add "new" node to cluster

Now, the OSDs have not moved at all, however /var/lib/ceph/osd does not contain entries for them. Is the best solution to remove and re-add the OSDs (even though the data is already there) therefore causing a lengthy migration, or is there a way to re-create the /var/lib/ceph/osd directories? can I just backup the contents of the directories from the "old" drive instance and restore them in the "new" one?

Thanks!
If I were doing this I would use something like clonezilla to clone the boot drive to another drive and expand the space. I'm not entirely sure gpartd can do this.
 
If I were doing this I would use something like clonezilla to clone the boot drive to another drive and expand the space. I'm not entirely sure gpartd can do this.
Yeah, I thought about doing this, but I am not sure how to deal with the tmeta partitions... If you have any thoughts, I'd be glad to hear them..
 
Yeah, I thought about doing this, but I am not sure how to deal with the tmeta partitions... If you have any thoughts, I'd be glad to hear them..

I don't. I would be somewhat nervous myself and would probably kick on a Clonezilla or macrium image to have a get-out-of-jail-free card. In fact, if you have an external USB chassis, you can make a Macrium boot ISO from their Free product and see what it picks up. It's pretty good about cloning. Any additional nodes I put on will be processing nodes only. If I had to add local storage I would probably just add the drive then format it singly as ZFS then hit it with 'VM Repository' flag. Then you can Migrate into that storage. I would have to look that one up.
 
I want to replace the Proxmox boot drive with a larger one.
offline this is pretty simple to do- clone the boot disk to a new (bigger) disk, reboot and resize as appropriate; there's no need reinstall or change anything in the cluster settings. can be done with any linux livecd. (just noticed @FarVision already suggested this- use clonezilla. do not attempt this on a mounted root filesystem.)

Yes and no. You can mix the order of the discs in the chassis and it will find them.
Comment was specific to the 4 node-in-a-chassis config the op was referring to. you are right as long as the right disks are presented to the right node.

I was having problems with my eSATA chassis and put in a USB 3 chassis. Upon reboot it picks up all four ceph disks.
I... dont have to tell you thats a bad way of doing ceph...
 
offline this is pretty simple to do- clone the boot disk to a new (bigger) disk, reboot and resize as appropriate; there's no need reinstall or change anything in the cluster settings. can be done with any linux livecd. (just noticed @FarVision already suggested this- use clonezilla. do not attempt this on a mounted root filesystem.)


Comment was specific to the 4 node-in-a-chassis config the op was referring to. you are right as long as the right disks are presented to the right node.


I... dont have to tell you thats a bad way of doing ceph...
clonezilla is probably what I will use, since I already have it on our PXE server and we've been using it for cloning and backing up physical machines for quite some time. I am just not sure how to resize the tmeta partitions that Proxmox uses..
 
offline this is pretty simple to do- clone the boot disk to a new (bigger) disk, reboot and resize as appropriate; there's no need reinstall or change anything in the cluster settings. can be done with any linux livecd. (just noticed @FarVision already suggested this- use clonezilla. do not attempt this on a mounted root filesystem.)


Comment was specific to the 4 node-in-a-chassis config the op was referring to. you are right as long as the right disks are presented to the right node.


I... dont have to tell you thats a bad way of doing ceph...

It was an interesting learning experience! Ceph actually stalled out the bus and offlined individual disks to keep the system running. The logs were pretty cool. The four drives were placed into an internal SATA bay. The eSATA HBA was choking and offlining the drive bay for some reason. Little Rosewill thing died for the cause. Ceph signatures all picked up OK since I'm doing everything on each drive. External drive chassis mothballed for something else

Also.. usb2 in a pinch will -never- catch up :)

Resurrecting old hardware and seeing what happens. A usb2 chassis works ok for a PBS ZFS RAIDz though.
 
Last edited:
clonezilla is probably what I will use, since I already have it on our PXE server and we've been using it for cloning and backing up physical machines for quite some time. I am just not sure how to resize the tmeta partitions that Proxmox uses..
Turns out clonezilla is not the solution. It gets confused with the tmeta partitions and croaks, even in "dd" mode.

What I ended up is the following:
- make sure there is no mgr/mds/mon daemons and no VMs/CTs on the node
- tar up /var/lib/ceph on the old drive and store it somewhere on the network
- replace boot drive and re-install from scratch
- remove old node from cluster
- add "new" machine on cluster
- restore tar file from above on /var/lib/ceph
- run the following script that restores a symlink for ceph.conf and sets up systemd symlinks to set things up correctly for OSDs:

Code:
#!/bin/bash -x

# restore /etc/ceph/ceph.conf
cd /etc/ceph
rm -f ceph.conf
ln -s /etc/pve/ceph.conf .

# fix systemd files
cd /etc/systemd/system/multi-user.target.wants
for a in $(ls /var/lib/ceph/osd); do
    name=`echo $a | sed 's/ceph-//'`
    ln -s /lib/systemd/system/ceph-volume@.service ceph-volume@lvm-$name-`cat /var/lib/ceph/osd/$a/fsid`.servic
e
done

and finally reboot.


This works like a charm.
 
Last edited:
  • Like
Reactions: FarVision
Turns out clonezilla is not the solution. It gets confused with the tmeta partitions and croaks, even in "dd" mode.

What I ended up is the following:
- make sure there is no mgr/mds/mon daemons and no VMs/CTs on the node
- tar up /var/lib/ceph on the old drive and store it somewhere on the network
- replace boot drive and re-install from scratch
- remove old node from cluster
- add "new" machine on cluster
- restore tar file from above on /var/lib/ceph
- run the following script that restores a symlink for ceph.conf and sets up systemd symlinks to set things up correctly for OSDs:

Code:
#!/bin/bash -x

# restore /etc/ceph/ceph.conf
cd /etc/ceph
rm -f ceph.conf
ln -s /etc/pve/ceph.conf .

# fix systemd files
cd /etc/systemd/system/multi-user.target.wants
for a in $(ls /var/lib/ceph/osd); do
    name=`echo $a | sed 's/ceph-//'`
    ln -s /lib/systemd/system/ceph-volume@.service ceph-volume@lvm-$name-`cat /var/lib/ceph/osd/$a/fsid`.servic
e
done

and finally reboot.


This works like a charm.

That's interesting. You didn't have any inconsistencies in your pgs? I'm a big ceph fan but I can't recall the last time I had a green status icon. It's always something.
 
No inconsistencies.. There is some initial delay bringing the OSDs up until they catch up, but that's the same you would get if a node is down for some time and the OSDs have to catch up. It takes me less

I know exactly what you mean about green status. There's always something...

It may be possible to manually create the OSD entries under /var/lib/ceph/osd from information readily available from other places, and not have to tar up all of /var/lib/ceph, but this works, and I'm happy about this.

The script does 2 things:
1. re-create the symlink from /etc/pve/ceph.conf to /etc/ceph/ceph.conff
2. re-create symlinks in /etc/systemd. This is a weird one. Without those systemd startup files /dev/dm-* that corresponds to the OSDs don't have the right ownership (that would be owned by root instead of ceph user). If you follow the /var/lib/ceph/osd/osd.*/block symlinks you'll end up to those /dev/dm files, and without the right permission the OSDs won't come up.
 
  • Like
Reactions: FarVision
No inconsistencies.. There is some initial delay bringing the OSDs up until they catch up, but that's the same you would get if a node is down for some time and the OSDs have to catch up. It takes me less

I know exactly what you mean about green status. There's always something...

It may be possible to manually create the OSD entries under /var/lib/ceph/osd from information readily available from other places, and not have to tar up all of /var/lib/ceph, but this works, and I'm happy about this.

The script does 2 things:
1. re-create the symlink from /etc/pve/ceph.conf to /etc/ceph/ceph.conff
2. re-create symlinks in /etc/systemd. This is a weird one. Without those systemd startup files /dev/dm-* that corresponds to the OSDs don't have the right ownership (that would be owned by root instead of ceph user). If you follow the /var/lib/ceph/osd/osd.*/block symlinks you'll end up to those /dev/dm files, and without the right permission the OSDs won't come up.

That doesn't sound bad, but I like to Encrypt my OSDs, so I can't imagine this not wrecking everything. I do like the normal method of removing drives, as I like to upgrade as time goes on. Glad it worked for you!
 
That doesn't sound bad, but I like to Encrypt my OSDs, so I can't imagine this not wrecking everything. I do like the normal method of removing drives, as I like to upgrade as time goes on. Glad it worked for you!
The way I see it, if that method does not work, what's the alternative? Erase the drives and let ceph do it's thing. It's only a win to try it out.
 
Stealing some of the pieces from another thread - just went through this myself, figured I'd share what worked 100% - about 20 drives completed so far, zero issues.


IMPORTANT: this assumes DB/WAL are on a single physical drive. If they aren't, you'll have to consolidate them down first, then move it.
(( ceph-volume lvm migrate --osd-id <ID> --osd-fsid <FSID> --from db wal --target <FULL VG PATH> ))


ORIGIN SERVER:

1. Find the values you need to proceed
**. FSID=cat /var/lib/ceph/osd/ceph-<ID>/fsid
** VG-ID=ls -l /var/lib/ceph/osd/ceph-<ID>/block | cut -f2 -d">" | cut -f3 -d"/"

2. set OSD out <ID>
3. systemctl stop ceph-osd@<ID>.service
4. ceph-volume lvm deactivate <ID> <FSID>
5. Vgchange -a n <VG-ID>
6. vgexport <VG-ID>


remove disk from server -->
--> input disk into other server


DESTINATION SERVER:
1. Pvscan
2. vgimport <VG-ID>
3. vgchange -a y <VG-ID>
4. ceph-volume lvm activate <ID> <FSID>
5. ceph osd in <ID>
6. ceph osd crush set <ID> <WEIGHT/SIZE> host=<NEWHOST>
7. systemctl status ceph-osd@<ID>.service




credit : https://forum.proxmox.com/threads/osd-move-issue.56932/post-263918
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!