ceph-msgr stuck at 100% CPU after upgrade to Proxmox 6

ITWarrior

Renowned Member
Jan 15, 2016
13
4
68
35
As the title suggests, I've upgraded two non-Ceph nodes from 5.4 to 6 and after a reboot both have the ceph-msgr processes locked at 100%:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13 root 20 0 0 0 0 R 100.0 0.0 2:28.44 kworker/0:1+ceph-msgr

While there are other Ceph nodes in this same cluster, neither of these nodes had Ceph installed before the upgrade. I've since installed Ceph Nautilus (through the UI) on both of these nodes in an attempt to resolve the issue, but the issue remains.

Any suggestions?
 
Is the network working properly (eg. packet loss)? Is the msgr2 protocol activate on the upgraded cluster? And could you please describe your setup further?
 
It's a seven node cluster spread out over two buildings. Connectivity is mostly gigabit with some 10G. Physically, it looks like this:

1G Switch A <-> 1G Switch B <-> 10G Switch C

node01 (192.168.1.201/24) - Switch A
node02 (192.168.1.202/24) - Switch A
node03 (192.168.1.203/24) - Switch B
node04 (192.168.1.204/24) - Switch B
node05 (192.168.1.205/24, 10.0.0.205/24) - Switch C
node06 (192.168.1.206/24, 10.0.0.206/24) - Switch C
node07 (192.168.1.207/24, 10.0.0.207/24) - Switch C

Latency is < 1ms everywhere. No packet loss.

nodes 5/6/7 all have Ceph running (a pretty much completely stock Proxmox/Ceph replicated 3x). Each has 4 enterprise class SSDs, 3 of which are for Ceph, so 9 SSDs in Ceph in total. All nodes 5/6/7 were built recently (last few months) and used the current Proxmox/Ceph version at the time (so 5.4 and Luminous).

The logical network is 2x /24s - a 192.168/24 for all seven nodes, plus a 10/24 for the three Ceph nodes (5/6/7). All Ceph traffic happens on the 10/24 network. Nodes 1/2/3/4 do not have any 10 addresses configured and so will never be able to reach Ceph.

All nodes are now running Proxmox 6, apart from node02 which is still on 5.4 due to be upgraded at the end of this week. I followed the upgrade guide for 5 -> 6 and for Luminous -> Nautilus to the letter, and the upgrades for all nodes went pretty much without a hitch (props to the devs!). Ceph is now using msgr2.

nodes1/3 have ceph-msgr running at 100%; maybe also node 4 but nodes 1 and 4 are spares and so were shut down after upgrade. node03 is currently online, and ceph-msgr is currently thrashing CPU. The network, hardware, VMs, config etc did not change between 5.4 and 6, Luminous and Nautilus (apart from the upgrade itself). This problem did not occur on 5.4.

I think that this might be a Proxmox issue, because Ceph wasn't installed on node03; I installing it after I saw the load (via the UI) thinking that it might help with the load issue. It didn't.

Thanks for looking into this.
 
Could you please post the /etc/pve/ceph.conf and /etc/pve/storage.cfg?
 
I figured this out, albeit sort of accidentally. The Ceph RBD stores in the Proxmox UI were set as node unrestricted, meaning that all nodes were trying to mount them. This wasn't an issue previously as the ones that couldn't reach it seemed to fail gracefully, but I guess something has changed that sends ceph-msgr into a tail spin if it can't route to the mounts.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!