No MDS service

permport

New Member
Sep 11, 2024
3
0
1
Thank you for accepting me to this board.

We need to upgrade our ProxMox but as a requirement we first need to update our ceph.

I'm following this guide to upgrade our Proxmox VE 6.4 Ceph Nautilus to Octopus: https://ainoniwa.net/pelican/2021-08-11a.html ( I know, it's in japanese but luckely there is google translate ) .

All went good until i come to the point that I have to "Upgrade all CephFS MDS daemons" (this is also the name of the chapter on that site) but there are no MDS services when I do a ceph status:

1726066102861.png

1726061750253.png

What can be wrong? How to solve this?

I've been looking for hours and it seems that I can't find the solution.

It's an environment that I have inherited from a engineer that has left the building. No documentation available.

Thank you for any help.
 

Attachments

  • 1726060778964.png
    1726060778964.png
    9.6 KB · Views: 6
  • 1726061717482.png
    1726061717482.png
    1.1 KB · Views: 3
Last edited:
Hi,
what does ceph fs ls show? Are there any cephfs storages defined in /etc/pve/storage.cfg? If you do not have any CephFS storage, but just RBD then there is no need for metadata servers and you can skip that step. Please also note that there is an official guide: https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

I'm assuming you are in the process of upgrading to a current Proxmox VE version, but otherwise, please note that Proxmox VE 6.x has been end-of-life since two years:
https://pve.proxmox.com/wiki/FAQ
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
https://pve.proxmox.com/wiki/Upgrade_from_7_to_8
 
  • Like
Reactions: permport
Thank you for your reply!

The output is:

root@pmnode1:~# ceph fs ls
No filesystems enabled

the content of /etc/pve/storage.cfg:

dir: local
path /var/lib/vz
content images,iso,backup
prune-backups keep-last=2
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

nfs: synnfs
export /volume2/nfsdumps/pmcluster
path /mnt/pve/synnfs
server xxx.xxx.xxx.50
content vztmpl,iso,images,snippets,backup
prune-backups keep-daily=3,keep-last=2

rbd: ceph-vm
content images
krbd 0
pool ceph-vm

rbd: ceph-ct
content rootdir
krbd 0
pool ceph-ct


Yes, indeed, we want to upgrade to a newer ProMox VE. But during the proces, we noticed that we must upgrade the ceph in first place.
 
Last edited:
so I continued. Just only have to unset noout.

The health status is not so good:

root@pmnode1:~# ceph status
cluster:
id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
health: HEALTH_ERR
5 scrub errors
Possible data damage: 3 pgs inconsistent
2 pools have too many placement groups

services:
mon: 3 daemons, quorum pmnode1,pmnode2,pmnode3 (age 20h)
mgr: pmnode2(active, since 20h), standbys: pmnode1
osd: 15 osds: 12 up (since 22m), 12 in (since 20h)

data:
pools: 3 pools, 513 pgs
objects: 398.46k objects, 1.5 TiB
usage: 4.5 TiB used, 83 TiB / 87 TiB avail
pgs: 510 active+clean
3 active+clean+inconsistent

io:
client: 19 MiB/s rd, 282 KiB/s wr, 4 op/s rd, 39 op/s wr


root@pmnode3:~# ceph health detail
HEALTH_ERR 5 scrub errors; Possible data damage: 3 pgs inconsistent; 2 pools have too many placement groups
[ERR] OSD_SCRUB_ERRORS: 5 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 3 pgs inconsistent
pg 2.2e is active+clean+inconsistent, acting [14,3,26]
pg 2.9e is active+clean+inconsistent, acting [2,15,26]
pg 2.e1 is active+clean+inconsistent, acting [2,26,13]
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool ceph-vm has 256 placement groups, should have 32
Pool ceph-ct has 256 placement groups, should have 32
 
Last edited:
so I continued. Just only have to unset nout.

The health status is not so good:

root@pmnode1:~# ceph status
cluster:
id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
health: HEALTH_ERR
5 scrub errors
Possible data damage: 3 pgs inconsistent
2 pools have too many placement groups

services:
mon: 3 daemons, quorum pmnode1,pmnode2,pmnode3 (age 20h)
mgr: pmnode2(active, since 20h), standbys: pmnode1
osd: 15 osds: 12 up (since 22m), 12 in (since 20h)

data:
pools: 3 pools, 513 pgs
objects: 398.46k objects, 1.5 TiB
usage: 4.5 TiB used, 83 TiB / 87 TiB avail
pgs: 510 active+clean
3 active+clean+inconsistent

io:
client: 19 MiB/s rd, 282 KiB/s wr, 4 op/s rd, 39 op/s wr


root@pmnode3:~# ceph health detail
HEALTH_ERR 5 scrub errors; Possible data damage: 3 pgs inconsistent; 2 pools have too many placement groups
[ERR] OSD_SCRUB_ERRORS: 5 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 3 pgs inconsistent
pg 2.2e is active+clean+inconsistent, acting [14,3,26]
pg 2.9e is active+clean+inconsistent, acting [2,15,26]
pg 2.e1 is active+clean+inconsistent, acting [2,26,13]
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool ceph-vm has 256 placement groups, should have 32
Pool ceph-ct has 256 placement groups, should have 32
The upgrade will not run because you have defective disks and already inconsistent data in 3 placement groups.
If you can assign which virtual disks use the defective PGs, then delete the virtual disks that reference these PGs, you can get the ceph healthy again.

My recommendation is to make backups of all VMs and LXC first.
Since troubleshooting is possible, but very time-consuming and requires a lot of know-how, I would rather rebuild the cluster and restore the VMs and containers.
 
  • Like
Reactions: permport

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!