No MDS service

permport · Sep 11, 2024

Thank you for accepting me to this board.

We need to upgrade our ProxMox but as a requirement we first need to update our ceph.

I'm following this guide to upgrade our Proxmox VE 6.4 Ceph Nautilus to Octopus: https://ainoniwa.net/pelican/2021-08-11a.html ( I know, it's in japanese but luckely there is google translate ) .

All went good until i come to the point that I have to "Upgrade all CephFS MDS daemons" (this is also the name of the chapter on that site) but there are no MDS services when I do a ceph status:

What can be wrong? How to solve this?

I've been looking for hours and it seems that I can't find the solution.

It's an environment that I have inherited from a engineer that has left the building. No documentation available.

Thank you for any help.

fiona · Sep 12, 2024

Hi,
what does ceph fs ls show? Are there any cephfs storages defined in /etc/pve/storage.cfg? If you do not have any CephFS storage, but just RBD then there is no need for metadata servers and you can skip that step. Please also note that there is an official guide: https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

I'm assuming you are in the process of upgrading to a current Proxmox VE version, but otherwise, please note that Proxmox VE 6.x has been end-of-life since two years:
https://pve.proxmox.com/wiki/FAQ
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
https://pve.proxmox.com/wiki/Upgrade_from_7_to_8

permport · Sep 12, 2024

Thank you for your reply!

The output is:

root@pmnode1:~# ceph fs ls
No filesystems enabled

the content of /etc/pve/storage.cfg:

dir: local
path /var/lib/vz
content images,iso,backup
prune-backups keep-last=2
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

nfs: synnfs
export /volume2/nfsdumps/pmcluster
path /mnt/pve/synnfs
server xxx.xxx.xxx.50
content vztmpl,iso,images,snippets,backup
prune-backups keep-daily=3,keep-last=2

rbd: ceph-vm
content images
krbd 0
pool ceph-vm

rbd: ceph-ct
content rootdir
krbd 0
pool ceph-ct

Yes, indeed, we want to upgrade to a newer ProMox VE. But during the proces, we noticed that we must upgrade the ceph in first place.

permport · Sep 12, 2024

so I continued. Just only have to unset noout.

The health status is not so good:

root@pmnode1:~# ceph status
cluster:
id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
health: HEALTH_ERR
5 scrub errors
Possible data damage: 3 pgs inconsistent
2 pools have too many placement groups

services:
mon: 3 daemons, quorum pmnode1,pmnode2,pmnode3 (age 20h)
mgr: pmnode2(active, since 20h), standbys: pmnode1
osd: 15 osds: 12 up (since 22m), 12 in (since 20h)

data:
pools: 3 pools, 513 pgs
objects: 398.46k objects, 1.5 TiB
usage: 4.5 TiB used, 83 TiB / 87 TiB avail
pgs: 510 active+clean
3 active+clean+inconsistent

io:
client: 19 MiB/s rd, 282 KiB/s wr, 4 op/s rd, 39 op/s wr

root@pmnode3:~# ceph health detail
HEALTH_ERR 5 scrub errors; Possible data damage: 3 pgs inconsistent; 2 pools have too many placement groups
[ERR] OSD_SCRUB_ERRORS: 5 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 3 pgs inconsistent
pg 2.2e is active+clean+inconsistent, acting [14,3,26]
pg 2.9e is active+clean+inconsistent, acting [2,15,26]
pg 2.e1 is active+clean+inconsistent, acting [2,26,13]
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool ceph-vm has 256 placement groups, should have 32
Pool ceph-ct has 256 placement groups, should have 32

Falk R. · Sep 12, 2024

permport said:
so I continued. Just only have to unset nout.

The health status is not so good:

root@pmnode1:~# ceph status
cluster:
id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
health: HEALTH_ERR
5 scrub errors
Possible data damage: 3 pgs inconsistent
2 pools have too many placement groups

services:
mon: 3 daemons, quorum pmnode1,pmnode2,pmnode3 (age 20h)
mgr: pmnode2(active, since 20h), standbys: pmnode1
osd: 15 osds: 12 up (since 22m), 12 in (since 20h)

data:
pools: 3 pools, 513 pgs
objects: 398.46k objects, 1.5 TiB
usage: 4.5 TiB used, 83 TiB / 87 TiB avail
pgs: 510 active+clean
3 active+clean+inconsistent

io:
client: 19 MiB/s rd, 282 KiB/s wr, 4 op/s rd, 39 op/s wr

root@pmnode3:~# ceph health detail
HEALTH_ERR 5 scrub errors; Possible data damage: 3 pgs inconsistent; 2 pools have too many placement groups
[ERR] OSD_SCRUB_ERRORS: 5 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 3 pgs inconsistent
pg 2.2e is active+clean+inconsistent, acting [14,3,26]
pg 2.9e is active+clean+inconsistent, acting [2,15,26]
pg 2.e1 is active+clean+inconsistent, acting [2,26,13]
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool ceph-vm has 256 placement groups, should have 32
Pool ceph-ct has 256 placement groups, should have 32

The upgrade will not run because you have defective disks and already inconsistent data in 3 placement groups.
If you can assign which virtual disks use the defective PGs, then delete the virtual disks that reference these PGs, you can get the ceph healthy again.

My recommendation is to make backups of all VMs and LXC first.
Since troubleshooting is possible, but very time-consuming and requires a lot of know-how, I would rather rebuild the cluster and restore the VMs and containers.

Search

Search

No MDS service

permport

New Member

Attachments

fiona

Proxmox Staff Member

permport

New Member

permport

New Member

Falk R.

Distinguished Member