Ceph with multipath

christian.g · Jun 4, 2020

Hi everyone.

I'm struggling with ceph using multipath and the pve provided tools.

My setup is a 3 node cluster with each node having a dedicated FC storage.
Each server has 2 HBAs to achieve hardware redundancy.
Multipath is configured, running and the device mapper block devices are present.

Code:

mpathe (36001b4d0e941057a0000000000000000) dm-4 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:2  sdh  8:112  active ready running
  `- 2:0:0:2  sdv  65:80  active ready running
mpathd (36001b4d0e93620000000000000000000) dm-3 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:11 sdq  65:0   active ready running
  `- 2:0:0:11 sdae 65:224 active ready running
mpathc (36001b4d0e94100700000000000000000) dm-2 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:10 sdp  8:240  active ready running
  `- 2:0:0:10 sdad 65:208 active ready running
mpathb (36001b4d0e940ac000000000000000000) dm-1 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:1  sdg  8:96   active ready running
  `- 2:0:0:1  sdu  65:64  active ready running
mpathn (36001b4d0e94107a30000000000000000) dm-13 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:9  sdo  8:224  active ready running
  `- 2:0:0:9  sdac 65:192 active ready running
mpatha (36001b4d0e94104000000000000000000) dm-0 SEAGATE,ST2400MM0129
size=2.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 0:0:0:0  sdf  8:80   active ready running
  `- 2:0:0:0  sdt  65:48  active ready running
...

So far so good.
The first thing i see is that the web ui doesn't show the device mapper block devices but instead shows the individual disks only, which is quite annoying.
I would expect to see the dm devices and have devices which are part of a multipath filtered out.
Anyway, i did setup the ceph cluster and now i'm trying to add the dm devices.
The web ui doesn't let me select any of the dm devices and hence no osd creation is possible.

So i tried using `pveceph osd create /dev/dm-0`but it's failing.

Bash:

-> pveceph osd create /dev/dm-0
unable to get device info for '/dev/dm-0'

I read somewhere that this may be due to blacklisting of device paths. Why are they blacklisted and why isn't this configurable?

Next i tried to use the native ceph tools but it's failing too.

Bash:

-> ceph-volume lvm create --data /dev/dm-0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a3b9ecd5-20bb-4503-ba0b-f751e12e3e88
 stderr: [errno 2] error connecting to the cluster
-->  RuntimeError: Unable to create a new OSD id

Not really helpful. Looked at the paths and voila the keyring doesn't exist.

So i exported it.

Bash:

-> ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring

Retry:

Bash:

-> ceph-volume lvm create --data /dev/dm-0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 062eadbd-fa4d-4945-be1c-5c49c85b3208
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
 stderr: 2020-06-04 13:05:27.729 7f6dbc620700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2020-06-04 13:05:27.729 7f6dbc620700 -1 AuthRegistry(0x7f6db40817b8) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: purged osd.0
-->  RuntimeError: Cannot use device (/dev/dm-0). A vg/lv path or an existing device is needed

Again exported the keyring to '/etc/pve/priv/ceph.client.bootstrap-osd.keyring' but the problem persists.

Bash:

-> ceph-volume lvm create --data /dev/dm-0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 82fe501a-22de-45df-bbcc-39c0e1704ba1
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
 stderr: purged osd.0
-->  RuntimeError: Cannot use device (/dev/dm-0). A vg/lv path or an existing device is needed

Ok, so i read about ceph and multipath https://docs.ceph.com/docs/mimic/ceph-volume/lvm/prepare/#multipath-support
"Devices that come from multipath are not supported as-is. The tool will refuse to consume a raw multipath device and will report a message like ...
If a multipath device is already a logical volume it should work, given that the LVM configuration is done correctly to avoid issues."

Ok, so i need to manually create the vg.

/etc/lvm/lvm.conf has multipath_component_detection enabled, so in theory lvm should be aware of the situation.
Still the web ui doesn't offer the dm devices in the LVM section

I created a pv/vg/lv on dm-0

Code:

pvcreate --metadatasize 250k -y -ff /dev/dm-0
vgcreate vg0 /dev/dm-0
lvcreate -n lv0 -l 100%FREE vg0

Finally i can create an OSD

Code:

-> ceph-volume lvm prepare --data vg0/lv0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new e1c9a9c5-e9b6-447f-ba6e-3b60a6f8e5ff
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -h ceph:ceph /dev/vg0/lv0
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-14
Running command: /usr/bin/ln -s /dev/vg0/lv0 /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
 stderr: got monmap epoch 2
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring --create-keyring --name osd.0 --add-key AQCn29heqI3xFxAA1OH32hEOZeNpzPGDdIdH0A==
 stdout: creating /var/lib/ceph/osd/ceph-0/keyring
 stdout: added entity osd.0 auth(key=AQCn29heqI3xFxAA1OH32hEOZeNpzPGDdIdH0A==)
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid e1c9a9c5-e9b6-447f-ba6e-3b60a6f8e5ff --setuser ceph --setgroup ceph
--> ceph-volume lvm prepare successful for: vg0/lv0

Bash:

ceph-volume lvm list


====== osd.0 =======

  [block]       /dev/vg0/lv0

      block device              /dev/vg0/lv0
      block uuid                45OjiN-ZBFl-Gnci-B7qS-lfDi-fWoZ-c1HdiS
      cephx lockbox secret
      cluster fsid              ed5ba9ab-6eb4-4ad2-aac9-8265e0cf5a02
      cluster name              ceph
      crush device class        None
      encrypted                 0
      osd fsid                  e1c9a9c5-e9b6-447f-ba6e-3b60a6f8e5ff
      osd id                    0
      type                      block
      vdo                       0
      devices                   /dev/mapper/mpatha

No OSD visible in the Web UI.

I manually activated the OSD.

Bash:

-> ceph-volume lvm activate osd.0
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/vg0/lv0 --path /var/lib/ceph/osd/ceph-0 --no-mon-config
Running command: /usr/bin/ln -snf /dev/vg0/lv0 /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-14
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/systemctl enable ceph-volume@lvm-0-e1c9a9c5-e9b6-447f-ba6e-3b60a6f8e5ff
 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-e1c9a9c5-e9b6-447f-ba6e-3b60a6f8e5ff.service → /lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@0
 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service → /lib/systemd/system/ceph-osd@.service.
Running command: /usr/bin/systemctl start ceph-osd@0
--> ceph-volume lvm activate successful for osd ID: 0

And finally the OSD is visible in the web ui.

Honestly guys, this is really really really a pain in the ...
Multipath is something absolutely common in an enterprise infrastructure and Proxmox should be able to handle this type of setup.

My 2 cents.

Alwin · Jun 4, 2020

christian.g said:
Honestly guys, this is really really really a pain in the ...
Multipath is something absolutely common in an enterprise infrastructure and Proxmox should be able to handle this type of setup.

Mhm... Multipath is not the issue here. You simply try to bring two different worlds together. Primarily, Ceph is not intended to run of an FC storage. This also defeats the purpose of running a distributed storage.

christian.g · Jun 4, 2020

Not really. The FC storage is directly connected to a single host through its HBAs. Multipath is for hardware redundancy (failing paths to the disks) and ceph for distributing the data across the directly attached storage of all 3 nodes.
The same would happen with a SAS storage or whatever technology someone prefers.

Alwin · Jun 4, 2020

Have fun. And I simply don't recommend this type of setup.

christian.g · Jun 4, 2020

Maybe it would make sense to explain why you don't recommend the usage of ceph with a direct-attached-storage using multipath.

Alwin · Jun 4, 2020

christian.g said:
Maybe it would make sense to explain why you don't recommend the usage of ceph with a direct-attached-storage using multipath.

On the forum you will find others that tried the FC storage + Ceph approach. They sure can tell you more.

Besides that, you will introduce more complexity and latency. Plus the setup makes a redundant storage redundant again. If something breaks, Ceph might not be able to recover as it usually would. If you can pull it off and it works adequately for you than I will be the last to stop you. But please be aware that your setup is an edge-case and no one might be able to help you when disaster strikes. So, be warned 'here are dragons'.

christian.g · Jun 4, 2020

Alwin said:
Plus the setup makes a redundant storage redundant again

This makes me think that you assume the FC Storage does redundancy on its own by doing RAID. Am i right? That's not the case the FC Storage runs in JBOD Mode. So the data redundancy is done by ceph only.
FC is in this context really just the transmission channel.
The only added redundancy is the the second path by using two HBAs with multipath.

Alwin · Jun 5, 2020

christian.g said:
This makes me think that you assume the FC Storage does redundancy on its own by doing RAID. Am i right? That's not the case the FC Storage runs in JBOD Mode. So the data redundancy is done by ceph only.

That was one of my thoughts. But meant more in general that, in my experience using systems that where target for doing their own redundancy will not further a pleasant Ceph experience. To say the least, JBOD != HBA (also speaking from experience). But hey, as I said, I am the last one to stop you. And you have been warned.

christian.g · Jun 5, 2020

In general i agree with you but throwing away FC storage systems which cost a lot is not an option.
But to be honest, this has nothing to do with the setup pain.
The same problems will arise with any multipath enabled system like a simple SAS Shelf with two HBAs.
So IMHO we should talk about that.

- Recognize multipath devices, show them in the disks section, filter out member disks from disks section
- Allow multipath devices to be used as OSDs with the pve tools
- Where do the keyring errors come from? Why aren't they already there after Ceph setup and configuration?

Alwin · Jun 5, 2020

christian.g said:
- Recognize multipath devices, show them in the disks section, filter out member disks from disks section

Multipath is configured through the multipath package on Debian. Once the disks show up correctly, you can use them for local storage.

christian.g said:
- Allow multipath devices to be used as OSDs with the pve tools

We also use ceph-volume here. Since it forces you to set them up manually, the same will go for our tooling.

christian.g said:
- Where do the keyring errors come from? Why aren't they already there after Ceph setup and configuration?

Bootstrap keys are located unter /var/lib/ceph after MONs have been setup.

christian.g said:
2020-06-04 13:05:27.729 7f6dbc620700 -1 AuthRegistry(0x7f6db40817b8) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx

How does the ceph.conf look like? And what pveversion -v are you using?

christian.g · Jun 5, 2020

Alwin said:
Multipath is configured through the multipath package on Debian. Once the disks show up correctly, you can use them for local storage.

Well and that's actually not the case.

Multipath is working fine and configured correctly.

Code:

-> multipath -l
mpathe
mpathd
mpathc
mpathb
mpathn
mpatha
mpathm
mpathl
mpathk
mpathj
mpathi
mpathh
mpathg
mpathf

But the disks section shows the individual disks only:

It doesn't matter if user_friendly_names is used or not.
The individual disks cannot be used as local storage as they wont show up in the dropdowns.

Alwin said:
We also use ceph-volume here. Since it forces you to set them up manually, the same will go for our tooling.

Sure but wouldn't it make sense to implement this in your tooling?
IMHO the point of the own tooling is to have simplified and convenient commands which do some extra stuff.

Alwin said:
Bootstrap keys are located unter /var/lib/ceph after MONs have been setup.

MONs and MGRs were already setup up.

Alwin said:
How does the ceph.conf look like? And what pveversion -v are you using?

My ceph.conf

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.0.104.1/24
         fsid = ed5ba9ab-6eb4-4ad2-aac9-8265e0cf5a02
         mon_allow_pool_delete = true
         mon_host = 10.0.103.1 10.0.103.2 10.0.103.3
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.0.103.1/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.hvprx01]
         public_addr = 10.0.103.1

[mon.hvprx02]
         public_addr = 10.0.103.2
        
[mon.hvprx03]
         public_addr = 10.0.103.3

Code:

-> pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

christian.g · Jun 5, 2020

I have had a look at the PVE Code more precisely in Diskmanage.pm->get_disks and now it's clear that multipath devices will never show up as only whitelisted devicename patterns and special types like zfs, ceph and lvm are handled and the individual disks will show up as they match the pattern. Hence it would be needed to add a custom handler.

Perl:

 462 sub get_disks {
 463     my ($disks, $nosmart) = @_;
 464     my $disklist = {};
 465
 466     my $mounted = {};
 467
 468     my $mounts = PVE::ProcFSTools::parse_proc_mounts();
 469
 470     foreach my $mount (@$mounts) {
 471         next if $mount->[0] !~ m|^/dev/|;
 472         $mounted->{abs_path($mount->[0])} = $mount->[1];
 473     };
 474
 475     my $dev_is_mounted = sub {
 476         my ($dev) = @_;
 477         return $mounted->{$dev};
 478     };
 479
 480     my $parttype_map = get_parttype_info();
 481
 482     my $journalhash = get_ceph_journals($parttype_map);
 483     my $ceph_volume_infos = get_ceph_volume_infos();
 484
 485     my $zfshash = get_zfs_devices($parttype_map);
 486
 487     my $lvmhash = get_lvm_devices($parttype_map);
 488
 489     my $disk_regex = ".*";
 490     if (defined($disks)) {
 491         if (!ref($disks)) {
 492             $disks = [ $disks ];
 493         } elsif (ref($disks) ne 'ARRAY') {
 494             die "disks is not a string or array reference\n";
 495         }
 496         # we get cciss/c0d0 but need cciss!c0d0
 497         map { s|cciss/|cciss!| } @$disks;
 498
 499         $disk_regex = "(?:" . join('|', @$disks) . ")";
 500     }
 501
 502     dir_glob_foreach('/sys/block', $disk_regex, sub {
 503         my ($dev) = @_;
 504         # whitelisting following devices
 505         # hdX: ide block device
 506         # sdX: sd block device
 507         # vdX: virtual block device
 508         # xvdX: xen virtual block device
 509         # nvmeXnY: nvme devices
 510         # cciss!cXnY: cciss devices
 511         return if $dev !~ m/^(h|s|x?v)d[a-z]+$/ &&
 512                   $dev !~ m/^nvme\d+n\d+$/ &&
 513                   $dev !~ m/^cciss\!c\d+d\d+$/;

Alwin · Jun 5, 2020

christian.g said:
The individual disks cannot be used as local storage as they wont show up in the dropdowns.

We filter wwn and do not allow the use on the GUI.

christian.g said:
Sure but wouldn't it make sense to implement this in your tooling?
IMHO the point of the own tooling is to have simplified and convenient commands which do some extra stuff.

pveceph tooling does some extras. But in this case it is unlikely that there is much support from Ceph upstream. This entails lots of unknowns and does not justify it.

christian.g said:
My ceph.conf

Config and version look good. Was there a previous Ceph installation? Do the logs tell more?

christian.g · Jun 5, 2020

Alwin said:
pveceph tooling does some extras. But in this case it is unlikely that there is much support from Ceph upstream. This entails lots of unknowns and does not justify it.

You're right, there is not much support from Ceph but it's still possible and Ceph doesn't say "Don't do this" they just say that they don't know the multipath configuration and hence reject to do it. But in the end it's just as simple as creating the pv/vg/lv.
First step would be to identify and list the mpath devices anyway. Then the user could be warned that this is not directly supported by ceph when he tries to create an OSD from an mpath device.

This is my simple script to do a one-shot creation of OSDs from all mpath devices.

Bash:

for f in $(multipath -l)
do
    pvcreate --metadatasize 250k -y -ff /dev/mapper/$f
    vgcreate vg$f /dev/mapper/$f
    lvcreate -n lv0 -l 100%FREE vg$f
    ceph-volume lvm prepare --data /dev/vg$f/lv0
done

ceph-volume lvm activate --all

This is not a production script but still shows that the needed work for OSD creation is minimal.
I know that there is more behind the scenes to do (backend/frontend).

Alwin said:
Config and version look good. Was there a previous Ceph installation? Do the logs tell more?

No, it's a fresh installation.
I will try to have a look at logs as soon as i have some time to do so.

christian.g · Jun 9, 2020

Read through all logs and didn't see anything special. Maybe this is beacuse of a mixed use of pve and native ceph tools?

tsimblist · Oct 2, 2020

christian.g said:
In general i agree with you but throwing away FC storage systems which cost a lot is not an option.
But to be honest, this has nothing to do with the setup pain.
The same problems will arise with any multipath enabled system like a simple SAS Shelf with two HBAs.
So IMHO we should talk about that.

- Recognize multipath devices, show them in the disks section, filter out member disks from disks section
- Allow multipath devices to be used as OSDs with the pve tools
- Where do the keyring errors come from? Why aren't they already there after Ceph setup and configuration?

This is almost exactly what I want to do. I have a simple SAS shelf with one HBA. Currently I have one SAS cable from the HBA to one expander in the shelf. I want to connect a second SAS cable from the HBA to the redundant expander in the shelf. This suggests improved bandwidth to the SATA drives (behind SATA/SAS interposers) in the shelf.

Sidney · Nov 3, 2020

I just ran into this exact issue setting up a single SAS shelf with single HBA. I somewhat understand why a complex FC storage network could be detrimental, but SAS shelves seem like the logical way to add the amount of drives needed for Ceph to function with enough IO to be useful. The output of multipath -ll shows devices mpatha-mpathz but the Proxmox gui displays the individual drives just as christian.g shows above. Nevermind Ceph, it seems that proxmox doesnt handle multipath devices which I would think should be standard fair for enterprise use.

tsimblist · Nov 4, 2020

I did go ahead and connect the second SAS cable from the HBA to the redundant expander in the shelf. And then crawled up the learning curve to get multipath working. At that time, I had three 1 TB spinners using a shared 500 GB SSD for their DB/WAL device plus three 500 GB spinners using a shared 250 GB SSD for their DB/WAL device.

After I was satisfied that everything was working, I noticed intermittent read errors were being logged from the 500 GB spinners. So I replaced them with two 1 TB SSDs. I destroyed the OSDs for the 500 GB spinners and created the new SSD OSDs using christian.g's osd.0 example (above) for my VG/LV naming convention.

Code:

root@epyc3000:~# multipath -ll
mpathc (350014ee2bcbb5ef9) dm-1 ATA,WDC WD10SPSX-22A
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:1:0  sdb     8:16  active ready running
  `- 8:0:10:0 sdj     8:144 active ready running
mpathb (35002538e30539f4b) dm-8 ATA,Samsung SSD 860
size=466G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:5:0  sdf     8:80  active ready running
  `- 8:0:14:0 sdn     8:208 active ready running
mpatha (35002538e39c3790b) dm-5 ATA,Samsung SSD 860
size=233G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:4:0  sde     8:64  active ready running
  `- 8:0:13:0 sdm     8:192 active ready running
mpathk (35001b448bb49d0aa) dm-0 ATA,WDC  WDBNCE0010P
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:0:0  sda     8:0   active ready running
  `- 8:0:9:0  sdi     8:128 active ready running
mpathj (35001b448bb483ed3) dm-10 ATA,WDC  WDBNCE0010P
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:6:0  sdg     8:96  active ready running
  `- 8:0:8:0  sdh     8:112 active ready running
mpathh (350014ee2bcb565f4) dm-2 ATA,WDC WD10SPSX-22A
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:2:0  sdc     8:32  active ready running
  `- 8:0:11:0 sdk     8:160 active ready running
mpathf (350014ee2675f84c2) dm-4 ATA,WDC WD10SPSX-22A
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 8:0:3:0  sdd     8:48  active ready running
  `- 8:0:12:0 sdl     8:176 active ready running
root@epyc3000:~#

damo2929 · Jul 28, 2022

christian.g said:
You're right, there is not much support from Ceph but it's still possible and Ceph doesn't say "Don't do this" they just say that they don't know the multipath configuration and hence reject to do it. But in the end it's just as simple as creating the pv/vg/lv.
First step would be to identify and list the mpath devices anyway. Then the user could be warned that this is not directly supported by ceph when he tries to create an OSD from an mpath device.

This is my simple script to do a one-shot creation of OSDs from all mpath devices.

Bash:

for f in $(multipath -l) do pvcreate --metadatasize 250k -y -ff /dev/mapper/$f vgcreate vg$f /dev/mapper/$f lvcreate -n lv0 -l 100%FREE vg$f ceph-volume lvm prepare --data /dev/vg$f/lv0 done ceph-volume lvm activate --all

This is not a production script but still shows that the needed work for OSD creation is minimal.
I know that there is more behind the scenes to do (backend/frontend).

No, it's a fresh installation.
I will try to have a look at logs as soon as i have some time to do so.

multipath code block in this doesn't work.
I tried fixing it up but still doesn't work

Bash:

for f in $(multipath -l |  grep dm-  | awk '{ print $1 }')
do
    pvcreate --metadatasize 250k -y -ff /dev/mapper/$f-part1
    vgcreate vg$f /dev/mapper/$f-part1
    lvcreate -n lv0 -l 100%FREE vg$f
    ceph-volume lvm prepare --data /dev/vg$f/lv0
done

ceph-volume lvm activate --all

I get

Code:

  Physical volume "/dev/mapper/35000cca03100fe6c" successfully created.
  Volume group "vg35000cca03100fe6c" successfully created
  Logical volume "lv0" created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 895b1b2b-2677-4d1b-8274-605c3567ad55
 stderr: 2022-07-28T13:09:06.065+0100 7fd9f8f8e700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
 stderr: 2022-07-28T13:09:06.065+0100 7fd9f8f8e700 -1 AuthRegistry(0x7fd9f405b868) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 AuthRegistry(0x7fd9f405b868) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 AuthRegistry(0x7fd9f4061d40) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2022-07-28T13:09:06.069+0100 7fd9f359e700 -1 AuthRegistry(0x7fd9f359d0d0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: [errno 2] RADOS object not found (error connecting to the cluster)
-->  RuntimeError: Unable to create a new OSD id

damo2929 · Jul 28, 2022

tsimblist said:
This is almost exactly what I want to do. I have a simple SAS shelf with one HBA. Currently I have one SAS cable from the HBA to one expander in the shelf. I want to connect a second SAS cable from the HBA to the redundant expander in the shelf. This suggests improved bandwidth to the SATA drives (behind SATA/SAS interposers) in the shelf.

this is what am also trying to do but with 2 SAS Shelfs
Direct SAS attached disk shelf to HBA port 1 & 2
Direct SAS attached disk shelf to HBA port 3 & 4

then wanting to use the via the multipath to OSD with the SAS connections giving me 24Gbit access to each shelf.

Ceph with multipath

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Active Member

Member

Active Member

Member

Member

We value your privacy