ceph issue after upgrade

akanarya

Member
Dec 18, 2020
14
0
6
50
Hi,
I have 3 nodes.
Today I updated 1 node: pve-kernel 6.4.-15 to 6.4-18, pve-manager 6.4-14 to 6.4.-15.
There was no ceph update at this time.
I checked the ceph to validate the status before update other servers.
Ceph seems working but the updated node failed the its own monitor and manager.

Here some logs:
Jul 01 16:24:39 vs5 ceph-mon[120912]: 2022-07-01T16:24:39.113+0300 7ff3dca6e5c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-vs5' does not exist: have you run 'mkfs'?
Jul 01 17:17:33 vs5 ceph-mgr[159946]: 2022-07-01T17:17:33.363+0300 7f167ed000c0 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-vs5/keyring: (2) No such file or directory
Jul 01 17:17:33 vs5 ceph-mgr[159946]: 2022-07-01T17:17:33.363+0300 7f167ed000c0 -1 AuthRegistry(0x55e39602c140) no keyring found at /var/lib/ceph/mgr/ceph-vs5/keyring, disabling cephx

root@vs5:~# systemctl status ceph-mon*
ceph-mon@vs5.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Fri 2022-07-01 16:24:49 +03; 1h 31min ago
Process: 120912 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id vs5 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 120912 (code=exited, status=1/FAILURE)

Jul 01 16:24:39 vs5 systemd[1]: ceph-mon@vs5.service: Failed with result 'exit-code'.
Jul 01 16:24:49 vs5 systemd[1]: ceph-mon@vs5.service: Service RestartSec=10s expired, scheduling restart.
Jul 01 16:24:49 vs5 systemd[1]: ceph-mon@vs5.service: Scheduled restart job, restart counter is at 5.
Jul 01 16:24:49 vs5 systemd[1]: Stopped Ceph cluster monitor daemon.
Jul 01 16:24:49 vs5 systemd[1]: ceph-mon@vs5.service: Start request repeated too quickly.
Jul 01 16:24:49 vs5 systemd[1]: ceph-mon@vs5.service: Failed with result 'exit-code'.
Jul 01 16:24:49 vs5 systemd[1]: Failed to start Ceph cluster monitor daemon.

● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; vendor preset: enabled)
Active: active since Fri 2022-07-01 11:48:56 +03; 6h ago

Jul 01 11:48:56 vs5 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.


I checked that following folders are empty at the updated node (vs5):
/var/lib/ceph/mon/
/var/lib/ceph/mgr/
there are no "ceph-vs5" folders inside these folders.

Since they are empty, pve manager doesnt allow me to destroy the monitor of that node, so that I could recreate again.

I searched the issue at forums but I am confused and dont want to mess up.
Any help is appreciated.
Thank you
Ali
 
OK
Luckily, I found an old backup and copied "ceph-vs5" folders of "/var/lib/ceph/mon/" & "/var/lib/ceph/mgr/" to the server, rebooted the server.
After that manager worked but monitor not worked.
Becuse there is its folder inside /lib/mon I could destroy the monitor an recreate it again.
Now problem resolved.

But what if I couldnt have a backup, how could I resolve it?
any suggesitons?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!