Ceph + secure communications + TPM disk ⇒ scary looking kernel error, 'no match of type 1 in addrvec', 'corrupt full osdmap', even when krbd not set

dwm

New Member
Jul 20, 2023
4
5
3
dwm.me.uk
[ This follows on from my previous comment on a different thread, https://forum.proxmox.com/threads/pverados-segfault.130628/post-574807 ]

I've just figured out the whats and whys of a problem I've been having trying to create a new VM that uses RBD disks hosted by an external Ceph cluster. I'm mainly writing this down to document the issue for anyone else running into the same problem, though the Proxmox VE developers might conceivably consider tweaking their code to support the more security-conscious!

The short version is:
• The virtual TPM used by Proxmox VE doesn't know how to talk to a Ceph RBD natively, so relies on having a kernel block device.
• Proxmox VE spots this, and so dutifully tries to use rbd map to prepare a suitable block device for it to use.
• In this scenario, I'm using an external Ceph cluster that insists on using msgr2 with full encryption ('secure') for all inter- and intra-cluster traffic.
• However, for this to work the rbd map command must be explicitly passed the --options ms_mode=secure flag to direct the kernel to use secure communications; 6.2-series kernels, at least, don't appear to be able to figure this out for themselves, and instead both fails to connect, and dumps some worrying-looking error messages and osdmap dump to the kernel log if the option isn't specified.
• Setting all the normal ceph.conf options to try to force the use of secure communications doesn't appear to help because the kernel doesn't read those settings, and the rbd map tool doesn't appear to pass on the necessary flag to the kernel.

The possible fixes to this problem are:
• Don't request a soft-TPM if you don't need one, thus avoiding any RBD mapping via krbd;
• Allow your Proxmox VE systems to talk to your Ceph cluster without insisting on secure communications (probably unwise!);
• The excellent Proxmox VE developers (hi!) add a configuration hook somewhere in the storage configuration to make it possible to specify that secure communications modes are required, and permute the rbd map command-line invocation for you accordingly;
• Someone teaches the Ceph rbd tool to pass on the necessary ms_mode flag to the kernel if the client configuration indicates it should;
• Someone fixes up the kernel to auto-negotiate a secure connection as necessary (ideally, defaulting to it?)

For the benefit of future searchers, here's some snippets from the kernel log:
Code:
Jul 20 16:23:47 FQDN kernel: libceph: mon2 (1)10.64.97.3:6789 session established
Jul 20 16:23:47 FQDN kernel: libceph: no match of type 1 in addrvec
Jul 20 16:23:47 FQDN kernel: libceph: corrupt full osdmap (-2) epoch 56000 off 5854 (0000000074d29902 of 0000000012b77a71-00000000d6ab7fb1)
(lengthy memory dump)

One of the hints that the wrong thing is happening in the above is that the port number is :6789. The secure communications port normally used by Ceph is :3300.

Anyway, I hope this helps someone!

Best wishes,
David
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!