PVE Ceph - Upgrade current OSDs with larger ones, 8 OSD (Filestore) shares a single journal NVMe

ekin06

Member
Jul 15, 2014
25
1
23
Hello,

I want to enlarge a Ceph cluster which uses filestore backend (want to keep that, it is running fine for like 5 years). We have 4 ceph nodes, each with 8 OSD á 3TB. The cluster is sometimes filled up to 85% (WARNING) and I have to manually intervene and free some storage space.

Each OSD in a node uses a 20GB journal partition on a 200GB Intel Optane DC P4801X NVMe. Got here now 24 x 6TB HDD WD Red which will replace the 3TB HDDs. I just want to make sure that I am doing all the steps required for the replacement correctly.

Edit:
Code:
# ceph -v
ceph version 14.2.22 (877fa256043e4743620f4677e72dee5e738d1226) nautilus (stable)

# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.157-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-11
pve-kernel-helper: 6.4-11
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.22-pve1
ceph-fuse: 14.2.22-pve1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1


Steps to replace OSDs one by one, node by node:

1. Down OSD [N] via GUI
2. Out OSD [N] via GUI
3. Destroy OSD [N] via GUI
4. Replace HDD in Server and check for correct detection
5. Now on console, create the new OSD with the former journal partition (check correct journal partition, blkid, cat /var/lib/ceph/osd/ceph-[N]/journal_uuid)
-> ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/nvme0n1p[Y]
6. In OSD [N] via GUI
7. Start OSD [N] via GUI
8. Wait till cluster is healthy
9. Repeat from step 1. with next OSD

Am I doing it right? Do I miss something? I would appreciate it if someone can confirm my approach, please?

Thank you and have a nice week everyone !
Kind Regards
ekin06
 
Last edited:
Thanks.

When I set up the cluster, filestore was a bit more perfomant so we kept it (In the beginning we only had Samsung 950 EVO as journal device, which I replaced earlier this year with the Intel drives).

And as far as I know (what I remember) bluestore doesn't benefit that much from a separate DB / WAL device/partition?
But I could think about switching to bluestore with the new Intel drives, if I got that point wrong.
 
Ok. Today I am replacing the first disk. So far everything went ok, I am just waiting for backfill. I just decided to stay with filestore because I am not sure about the outcome. Also had some trouble to identify the correct HDD because of those s*** workstation cages (unfortunately those 4 are not 19'') without hotswap. Step 6 and 7 were not needed because the new OSD already was in and running after creation.

1. Down OSD [N] via GUI -> OK (was waiting for healthy cluster to be safe)
2. Out OSD [N] via GUI -> OK
3. Destroy OSD [N] via GUI
-> not so OK because with the mark "cleanup disks" it will also wipe and destroy the journal partition". So you will have to run fdisk before step 5 and recreate the partition with type 74 (ceph journal).
4. Replace HDD in Server and check for correct detection
5. Create OSD "ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/nvme0n1p[Y]" -> went with some errors
root@pve04:~# ceph-volume lvm create --filestore --data /dev/sde --journal /dev/nvme0n1p5
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 38d8f5b7-3c33-4b7f-a904-a398a5cde239
Running command: /sbin/vgcreate --force --yes ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf /dev/sde
stdout: Physical volume "/dev/sde" successfully created.
stdout: Volume group "ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf" successfully created
Running command: /sbin/lvcreate --yes -l 1430791 -n osd-data-38d8f5b7-3c33-4b7f-a904-a398a5cde239 ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf
stdout: Logical volume "osd-data-38d8f5b7-3c33-4b7f-a904-a398a5cde239" created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /sbin/mkfs -t xfs -f -i size=2048 /dev/ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf/osd-data-38d8f5b7-3c33-4b7f-a904-a398a5cde239
stdout: meta-data=/dev/ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf/osd-data-38d8f5b7-3c33-4b7f-a904-a398a5cde239 isize=2048 agcount=6, agsize=268435455 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0
data = bsize=4096 blocks=1465129984, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Running command: /bin/mount -t xfs -o rw,noatime,inode64 /dev/ceph-f882ab05-dd25-4697-ba17-620bfd4c31cf/osd-data-38d8f5b7-3c33-4b7f-a904-a398a5cde239 /var/lib/ceph/osd/ceph-28
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p5
Running command: /bin/ln -s /dev/nvme0n1p5 /var/lib/ceph/osd/ceph-28/journal
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-28/activate.monmap
stderr: 2022-06-09 17:50:36.869 7f9aa2a17700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2022-06-09 17:50:36.869 7f9aa2a17700 -1 AuthRegistry(0x7f9a9c041248) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: got monmap epoch 8
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-28/journal
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p5
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-28/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore filestore --mkfs -i 28 --monmap /var/lib/ceph/osd/ceph-28/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-28/ --osd-journal /var/lib/ceph/osd/ceph-28/journal --osd-uuid 38d8f5b7-3c33-4b7f-a904-a398a5cde239 --setuser ceph --setgroup ceph
stderr: 2022-06-09 17:50:37.129 7fbd1f73cc80 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-28//keyring: (2) No such file or directory
stderr: 2022-06-09 17:50:37.129 7fbd1f73cc80 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-28//keyring: (2) No such file or directory
stderr: 2022-06-09 17:50:37.129 7fbd1f73cc80 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-28//keyring: (2) No such file or directory
stderr: 2022-06-09 17:50:37.225 7fbd1f73cc80 -1 journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected 38d8f5b7-3c33-4b7f-a904-a398a5cde239, invalid (someone else's?) journal
stderr: 2022-06-09 17:50:37.269 7fbd1f73cc80 -1 journal do_read_entry(4096): bad header magic
stderr: 2022-06-09 17:50:37.269 7fbd1f73cc80 -1 journal do_read_entry(4096): bad header magic
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-28/keyring --create-keyring --name osd.28 --add-key AQC/FqJiLNXVEhAAHsJyyL64m5FN6g6r2zsN3g==
stdout: creating /var/lib/ceph/osd/ceph-28/keyring
added entity osd.28 auth(key=AQC/FqJiLNXVEhAAHsJyyL64m5FN6g6r2zsN3g==)
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-28/keyring
--> ceph-volume lvm prepare successful for: /dev/sde
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-28
Running command: /bin/ln -snf /dev/nvme0n1p5 /var/lib/ceph/osd/ceph-28/journal
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p5
Running command: /bin/systemctl enable ceph-volume@lvm-28-38d8f5b7-3c33-4b7f-a904-a398a5cde239
stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-28-38d8f5b7-3c33-4b7f-a904-a398a5cde239.service → /lib/systemd/system/ceph-volume@.service.
Running command: /bin/systemctl enable --runtime ceph-osd@28
Running command: /bin/systemctl start ceph-osd@28
--> ceph-volume lvm activate successful for osd ID: 28
--> ceph-volume lvm create successful for: /dev/sde
For now I am going to ignore it and wait for the full backfill. Is it any critical? Cluster is healthy and the new OSD appears to be ok in the GUI and is getting filled (~450MB/s). I will restart the server tomorrow and see if things will be fine. If all is cool... next OSD will be swapped.

8. Wait till cluster is healthy -> Cluster is healthy already, but I am waiting for the backfill to be completed (dunno if this is even necessary?).

UPDATE (06/14/22):
Ok my current routine is going slow but well. Darn this will take a month until all disks are replaced.
I switched step 1 (Down) with step 2 (Out) because this is what the PVE Admin Guide says.
To replace a functioning disk from the GUI, go through the steps in Destroy OSDs. The only addition is to wait until the cluster shows HEALTH_OK before stopping the OSD to destroy it.
On the command line, use the following commands:
ceph osd out osd.<id>

So routine it is now:

1. Out OSD [N] via GUI (wait HEALTH_OK)
2. Down OSD [N] via GUI (wait HEALTH_OK -> esp. red PGs need to disappear before cluster can go green)
3. Destroy OSD [N] via GUI (NO disk cleanup to prevent seperate journal partition getting destroyed -> no hassle with recreation)
4. Replace HDD in Server (check detection, restart server on problems)
5. Create OSD "ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/nvme0n1p[Y]"
6. Next OSD

I don't know yet how important it is to wait for complete backfilling before destroying the OSD. But I think if cluster is OK, it is safe to destroy it before backfilling finished. This would speed up things a bit.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!