Broken Ceph OSD created

Shuazi · Dec 28, 2023

As the title suggests, I have installed Proxmox ve v8.1 on my server and encountered an issue while configuring Ceph. When creating OSD, there will be the following error message. I checked the disk and found that LVM2_ The member has been successfully created, but the OSD creation failed. After searching for the logs, it was found that the corresponding vg volume group could not be found when the command "Volume group" ceph-3624de09-6d03-4264-b5c6-810a7ca11fdc "successfully created was executed. It seems that PVE cannot find the volume group. Both node servers in the current disk cabinet can access the same hard drive simultaneously. Due to the device being EOL and not having a valid service subscription, I am unable to obtain support from Oracle and cannot determine the specific reason.

Current hardware environment:

Oracle ODA x5-2+disk storage enclosure (HBA connected to two nodes)
CPU: E5-2699V3 * 2 X2
Mem: 256gb X2


create OSD on /dev/sdy (bluestore)
wiping block device /dev/sdy
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.509536 s, 412 MB/s
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 4a55bbc6-5e30-4e4d-8a8d-3a3985eaf6cc
Running command: vgcreate --force --yes ceph-3624de09-6d03-4264-b5c6-810a7ca11fdc /dev/sdy
 stdout: Physical volume "/dev/sdy" successfully created.
 stdout: Volume group "ceph-3624de09-6d03-4264-b5c6-810a7ca11fdc" successfully created
--> Was unable to complete a new OSD, will rollback changes
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
 stderr: purged osd.0
-->  RuntimeError: Unable to find any LV for zapping OSD: 0
TASK ERROR: command 'ceph-volume lvm create --data /dev/sdy' failed: exit code 1

aaron · Dec 28, 2023

If you want to use Ceph, you should have at least 3 nodes. Unless I understood something wrong, there are only 2 at the moment?

Ceph expects the disks to be local in each node. Using a shared disk shelf that is accessed by multiple nodes, kind of works against the premise and failure scenarios Ceph was built for.

If you still want to use the disk shelf for testing and playing around purposes, try if you can configure it in a way that each node only sees a subset of the disks, so that they don't see the same disks.

Shuazi · Dec 28, 2023

Hareware:↓

Currently, there are only two nodes. If you don't use CEPH, are there any other highly available solutions? For example, using ZFS to synchronize files?

aaron · Dec 28, 2023

If I understand the hardware, then both nodes connect directly to the disk shelf and see all disks?
The disk shelf has nothing smart in it and is just a JBOD?

Maybe there are even with multiple connections from each node to the disk shelf? Therefore, each disk shows up at least 2 times?

If the disk shelf is really just a JBOD, I am not sure how to successfully integrate it with some kind of redundancy for disk failures.

Should the disk shelf be smart and able to do some kind of RAID by itself and then only export volumes to the nodes, you could basically follow the guide for multipathing and iSCSI (in case each disk shows up multiple times).

Then once you see the large disks on each node, the one option that works out of the box is to configure one LVM storage on it and mark it as "shared". This will let Proxmox VE know that each node in the cluster (or the nodes to which this storage is limited) sees the same contents.
It cannot be thin LVM as that will break if multiple nodes will access it.

Or, you could use some kind of clustered file system (OCFS2, GFS2, …) and configure a Directory storage pointing to the path where it is mounted. In the latter case, you will also have to mark it as shared and add the following configuration to it, to tell Proxmox VE that it should only start using the storage if there is something mounted at the specified path:

Code:

pvesm set <storage name> --is_mountpoint 1

Shuazi · Dec 28, 2023

Yes, the serial number of each disk will be displayed twice. As long as one disk is formatted, the corresponding status will also change after reloading on the other node

alexskysilk · Dec 28, 2023

Shuazi said:
Yes, the serial number of each disk will be displayed twice. As long as one disk is formatted, the corresponding status will also change after reloading on the other node

multipathd is your friend

have a look at the forum as to deploying proxmox with multipathed storage. deploying lvm thick is the most supported, but there is at least one gentleman who deployed gfs2 for a truly clustered fs- see https://forum.proxmox.com/threads/p...-lvm-lv-with-msa2040-sas-partial-howto.57536/ if you're brave

Shuazi · Dec 29, 2023

alexskysilk said:
multipathd is your friend have a look at the forum as to deploying proxmox with multipathed storage. deploying lvm thick is the most supported, but there is at least one gentleman who deployed gfs2 for a truly clustered fs- see https://forum.proxmox.com/threads/p...-lvm-lv-with-msa2040-sas-partial-howto.57536/ if you're brave

This seems to go back to the original problem. The use of the command vgcreate seems to have successfully created a volume group, and the disk has also been modified. However, when executing the lvcreate command, the system still returns that the created volume group name cannot be found, which is also one of the reasons for OSD creation failure.

alexskysilk · Dec 29, 2023

so, lets take a few steps back.

1. cabling- each of your nodes has 2 sas ports. your storage has 2 controllers/LRCs. you should have each port on each host connected to each LRC.
2. I dont know what kind of hba is in these nodes, but assuming its some kind of LSI you should be seeing each disk drive in the array listed twice in bios. verify by providing the output of lsblk.
3. next, make sure you dont have any file system or partition on any of the SAN volumes- vgs will show you if there are any; delete them using vgremove.
3. Install and configure multipath like so:
apt update && apt install multipath-tools && systemctl restart multipathd
4. verify that the paths are detected and working: multipath -ll
you should see new devices available now that look like /dev/mpathxx. you can now create volume groups on them.

edit HA! I like number 3 so much I used it twice!

aaron · Dec 29, 2023

correct me if I am wrong. if the each disk is exposed directly, then yes, it is possible to setup multipathing on top of it (redundancy when accessing the disks) and then create an LVM on top which the PVE cluster can use in a shared way in the cluster.

BUT there is nothing in the stack protecting against the failure of a single disk. Usually, what you have is a not just a simple JBOD but a smarter storage box that can do RAID and volume management and then expose those volumes as block devices to the servers.
That case would be handled the same, but with less "disks" seen by the servers.

In the case that you are really dealing with a simple JBOD that just exposes each individual disk, the only way I currently see is to make sure that each server is only using one half of the disks. Ideally you could configure the disk shelf to only show one half of the disks to each server to drastically reduce the chances of misconfigurations.
If that is not possible, you need to make sure yourself, which disks "belong" to which server and then use them basically as local storage. Then you can configure ZFS on top of the mpath devices or whatever you want. But that way you can get redundancy against disk failures as well.
But that sounds very error-prone as it it could quickly happen that Server 1 is accessing the disks of Server 2. Therefore corrupting data. All it takes is one moment of fatfingering a command.

Some background info: (thick) LVM can be shared in a PVE cluster because each logical volume is clearly defined where it starts and ends. By locking the storage cluster wide on each change, we can avoid data corruption. Each LV is accessed by one single VM at any time as well to avoid data corruption.

There is nothing that could do that with RAID, which is local to each node.

Shuazi · Dec 29, 2023

The multipath - ll instruction did not return any devices after execution, and after restarting the node, it damaged the normal startup of the machine, which can only be resolved through a hard restart. So can only solve the problem of data sharing through zfs+scheduled synchronization? Multipath seems unsuitable for the current hardware system.

Shuazi · Dec 29, 2023

Hardware IMG:↓

beep。。
This hardware is already EOL, and I do not have a valid Oracle service subscription, so I cannot check whether the disk enclosure supports direct configuration of the raid environment on the support website.

alexskysilk · Dec 29, 2023

aaron said:
There is nothing that could do that with RAID, which is local to each node.

Well... not nothing. You have a few options but none of them are particularly well supported, and I would not suggest for production.

1. clustered zfs. yes, this is possible (https://github.com/ewwhite/zfs-ha/wiki.) would take some effort to get this shoehorned alongside with proxmox.
2. single set gluster.
3. replace HBAs with LSI Syncro 9268-8e. this would be a bit tricky to find an unmolested pair since most people just flash IT mode firmware on those, but should be doable (not a software solution, just presenting the option)

lastly- poor man's raid 1. create a pair (or pairs) of vgs on physical disks, and assign vdisks to your guests. raid1 in the guest.

point still stands- this is not an ideal hardware configuration for proxmox.

Search

Search

Broken Ceph OSD created

Shuazi

New Member

aaron

Proxmox Staff Member

Shuazi

New Member

aaron

Proxmox Staff Member

Shuazi

New Member

alexskysilk

Distinguished Member

Shuazi

New Member

alexskysilk

Distinguished Member

aaron

Proxmox Staff Member

Shuazi

New Member

Shuazi

New Member

alexskysilk

Distinguished Member