Proxmox GUI freaks out after adding ceph storage

brucexx

Renowned Member
Mar 19, 2015
252
9
83
I have a 5 node PVE cluster (ver 6.4-1 running kernel: 5.4.124-1-pve) with existing 4 node ceph storage installed under proxmox (ver 14.2.6) - working great for at least 700 days.

Today I added secondary 4 node ceph cluster working under proxmox ver 16.2.7. This cluster was working in the lab and added to at least two independent nodes in the lab -- working great.

When I added the new storage I lost majority of GUI on node 1 - I could not list storage , I could not look at VM settings nor console, the only part that was working was the summary page - also all the VMs under this node were operational. On other 4 nodes the only GUI thing that was not working was the VM console. The cluster had quorum. I ended up removing the rbd pool via CLI with: pvesm remove cephhdd-pool-2 command and I did not have to use --force. After removing it the cluster was operational but the GUI was the same broken.

here is output from CLI from node :

root@pve01:/etc/pve/priv/ceph# systemctl status pvedaemon.service
â— pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-08-31 05:38:12 EDT; 5 months 15 days ago
Process: 2290 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
Main PID: 2318 (pvedaemon)
Tasks: 46 (limit: 13516)
Memory: 281.9M
CGroup: /system.slice/pvedaemon.service
â”─ 2318 pvedaemon
â”─11319 pvedaemon worker
â”─11506 /usr/bin/rbd -p cephhdd-pool-2 -m 10.221.1.170,10.221.1.171,10.221.1.172,10.221.1.173 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/cephhdd-pool-2.keyring ls -l --format json
â”─11536 /usr/bin/rbd -p cephhdd-pool-2 -m 10.221.1.170,10.221.1.171,10.221.1.172,10.221.1.173 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/cephhdd-pool-2.keyring ls -l --format json
â”─11977 /usr/bin/rbd -p cephhdd-pool-2 -m 10.221.1.170,10.221.1.171,10.221.1.172,10.221.1.173 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/cephhdd-pool-2.keyring ls -l --format json
â”─63788 pvedaemon worker
└─88845 pvedaemon worker

Feb 14 19:49:38 pve01 pvedaemon[88845]: rados_connect failed - Operation not supported
Feb 14 19:49:40 pve01 pvedaemon[88845]: rados_connect failed - Operation not supported
Feb 14 19:49:42 pve01 pvedaemon[88845]: rados_connect failed - Operation not supported
Feb 14 19:49:48 pve01 pvedaemon[11319]: rados_connect failed - Operation not supported
Feb 14 19:49:50 pve01 pvedaemon[11319]: rados_connect failed - Operation not supported
Feb 14 19:49:52 pve01 pvedaemon[11319]: rados_connect failed - Operation not supported
Feb 14 19:49:54 pve01 pvedaemon[11319]: rados_connect failed - Operation not supported
Feb 14 19:51:03 pve01 pvedaemon[63788]: <root@pam> successful auth for user 'root@pam'
Feb 14 19:51:42 pve01 pvedaemon[88845]: got timeout
Feb 14 19:51:48 pve01 pvedaemon[88845]: got timeout


Here is proxy:
root@pve01:/etc/pve/priv/ceph# systemctl status pveproxy.service
â— pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-08-31 05:38:13 EDT; 5 months 15 days ago
Process: 2322 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=1/FAILURE)
Process: 2326 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Process: 66247 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
Main PID: 2328 (pveproxy)
Tasks: 4 (limit: 13516)
Memory: 322.9M
CGroup: /system.slice/pveproxy.service
â”─ 2328 pveproxy
â”─31961 pveproxy worker
â”─39707 pveproxy worker
└─42223 pveproxy worker

Feb 14 20:38:56 pve01 pveproxy[31961]: proxy detected vanished client connection
Feb 14 20:39:30 pve01 pveproxy[39707]: proxy detected vanished client connection
Feb 14 20:39:34 pve01 pveproxy[42223]: proxy detected vanished client connection
Feb 14 20:43:31 pve01 pveproxy[42223]: proxy detected vanished client connection
Feb 14 20:43:39 pve01 pveproxy[31961]: proxy detected vanished client connection
Feb 14 20:44:12 pve01 pveproxy[31961]: proxy detected vanished client connection
Feb 14 20:44:33 pve01 pveproxy[42223]: proxy detected vanished client connection
Feb 14 20:45:23 pve01 pveproxy[31961]: proxy detected vanished client connection
Feb 14 20:45:31 pve01 pveproxy[39707]: proxy detected vanished client connection
Feb 14 20:45:53 pve01 pveproxy[42223]: proxy detected vanished client connectio

Here is pvestatd status:
root@pve01:/etc/pve/priv/ceph# systemctl status pvestatd.service
â— pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-08-31 05:38:11 EDT; 5 months 15 days ago
Process: 2271 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 2298 (pvestatd)
Tasks: 1 (limit: 13516)
Memory: 224.5M
CGroup: /system.slice/pvestatd.service
└─2298 pvestatd

Feb 14 20:29:09 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:19 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:29 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:39 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:50 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:59 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:30:10 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:30:19 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:30:30 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:30:39 pve01 pvestatd[2298]: rados_connect failed - Operation not supported

@ 20:30 I executed the pvesm remove cephhdd-pool-2 .


The fix was easy and quickcly restored GUI on all nodes:

service pveproxy restart
service pvedaemon restart



...but I am struggling to understand why adding ceph storage would caused that. Some type of version incompatibility ? Also the newely added storage was not working, was listed with ? mark and was timing out even on the other 4 nodes.

Thank you for any suggestions.
 
Feb 14 20:29:09 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Feb 14 20:29:19 pve01 pvestatd[2298]: rados_connect failed - Operation not supported
Is this the one node in that 5-node cluster which is not part of the Ceph cluster? If so, does it have the Ceph repository configured? If not, it is possible that the client is too old. Configuring the Ceph repository and updating packages could show a newer version for the Ceph client.

My suspicion, if that is the case, is that the global_id_reclaim settings might be the cause for the connect failures due to a client which is too old.
https://pve.proxmox.com/wiki/Ceph_N..._the_.60insecure_global_id_reclaim.60_Warning
 
Thank you for prompt response.

In the 5 node cluster that was affected none of the nodes have ceph installed. They are just connecting to the existing/older ceph cluster that consists of 4 nodes. They are using ceph-fuse client version 12.2.11+dfsg1-2.1+b1 and the new cluster on 16.2.7 says that the min required is luminous which 12.2 is.

I was updating this 5 node cluster about 6 months ago to 6.4.x from I think 6.1, shouldn't it update the client as well or I messed it up somehow ?

More pressing question, in this case should I just update the client ? I seen no updates for the client in updates (I have community sub) , or should I install ceph on these 5 nodes without configuring it just to get the new client ?

in both cases how disruptive would that be to the 5 node PVE cluster 6.4.1 and existing/older ceph cluster (Nautilus) which is connected to it - should I move all the VMs from that older cluster to local drives first before updating the client (I have like 230 VMs on that cluster) ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!