Proxmox 6 - Ceph MDS stuck on creating

Interesting errors:

Code:
-- Logs begin at Sat 2019-08-31 17:13:20 PDT, end at Sat 2019-08-31 18:13:38 PDT. --
Aug 31 17:13:24 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 17:13:24 cl-01 ceph-mds[1221]: starting mds.cl-01 at
Aug 31 17:14:59 cl-01 systemd[1]: Stopping Ceph metadata server daemon...
Aug 31 17:14:59 cl-01 ceph-mds[1221]: 2019-08-31 17:14:59.308 7feaabff0700 -1 received  signal: Terminated from /sbin/init nofb  (PID: 1) UID: 0
Aug 31 17:14:59 cl-01 ceph-mds[1221]: 2019-08-31 17:14:59.308 7feaabff0700 -1 mds.cl-01 *** got signal Terminated ***
Aug 31 17:15:00 cl-01 ceph-mds[1221]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7feaaf241340 time 2019-08-31 17:15:00.966856
Aug 31 17:15:00 cl-01 ceph-mds[1221]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7feab0ae65c6]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  2: (()+0x28079e) [0x7feab0ae679e]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  3: (()+0x3dcfb3) [0x55d7907dffb3]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  4: (()+0x39d8c) [0x7feaaf776d8c]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  5: (()+0x39eba) [0x7feaaf776eba]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  6: (__libc_start_main()+0xf2) [0x7feaaf7610a2]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  7: (_start()+0x2a) [0x55d79053f8ca]
Aug 31 17:15:00 cl-01 ceph-mds[1221]: *** Caught signal (Segmentation fault) **
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  in thread 7feaaf241340 thread_name:ceph-mds
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  1: (()+0x12730) [0x7feaafeaf730]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  2: (__pthread_mutex_lock()+0) [0x7feaafea76c0]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7feab0e590b1]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7feab0ae66c0]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  5: (()+0x28079e) [0x7feab0ae679e]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  6: (()+0x3dcfb3) [0x55d7907dffb3]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  7: (()+0x39d8c) [0x7feaaf776d8c]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  8: (()+0x39eba) [0x7feaaf776eba]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  9: (__libc_start_main()+0xf2) [0x7feaaf7610a2]
Aug 31 17:15:00 cl-01 ceph-mds[1221]:  10: (_start()+0x2a) [0x55d79053f8ca]
Aug 31 17:15:00 cl-01 systemd[1]: ceph-mds@cl-01.service: Main process exited, code=killed, status=11/SEGV
Aug 31 17:15:00 cl-01 systemd[1]: ceph-mds@cl-01.service: Failed with result 'signal'.
Aug 31 17:15:00 cl-01 systemd[1]: Stopped Ceph metadata server daemon.
Aug 31 17:16:18 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 17:16:18 cl-01 ceph-mds[3972]: starting mds.cl-01 at
Aug 31 18:01:23 cl-01 systemd[1]: Stopping Ceph metadata server daemon...
Aug 31 18:01:23 cl-01 ceph-mds[3972]: 2019-08-31 18:01:23.847 7fdb81da6700 -1 received  signal: Terminated from /sbin/init nofb  (PID: 1) UID: 0
Aug 31 18:01:23 cl-01 ceph-mds[3972]: 2019-08-31 18:01:23.847 7fdb81da6700 -1 mds.cl-01 *** got signal Terminated ***
Aug 31 18:01:26 cl-01 ceph-mds[3972]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7fdb84ff7340 time 2019-08-31 18:01:26.522080
Aug 31 18:01:26 cl-01 ceph-mds[3972]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fdb8689c5c6]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  2: (()+0x28079e) [0x7fdb8689c79e]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  3: (()+0x3dcfb3) [0x557cfc66bfb3]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  4: (()+0x39d8c) [0x7fdb8552cd8c]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  5: (()+0x39eba) [0x7fdb8552ceba]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  6: (__libc_start_main()+0xf2) [0x7fdb855170a2]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  7: (_start()+0x2a) [0x557cfc3cb8ca]
Aug 31 18:01:26 cl-01 ceph-mds[3972]: *** Caught signal (Segmentation fault) **
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  in thread 7fdb84ff7340 thread_name:ceph-mds
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  1: (()+0x12730) [0x7fdb85c65730]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  2: (__pthread_mutex_lock()+0) [0x7fdb85c5d6c0]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7fdb86c0f0b1]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7fdb8689c6c0]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  5: (()+0x28079e) [0x7fdb8689c79e]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  6: (()+0x3dcfb3) [0x557cfc66bfb3]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  7: (()+0x39d8c) [0x7fdb8552cd8c]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  8: (()+0x39eba) [0x7fdb8552ceba]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  9: (__libc_start_main()+0xf2) [0x7fdb855170a2]
Aug 31 18:01:26 cl-01 ceph-mds[3972]:  10: (_start()+0x2a) [0x557cfc3cb8ca]
Aug 31 18:01:26 cl-01 systemd[1]: ceph-mds@cl-01.service: Main process exited, code=killed, status=11/SEGV
Aug 31 18:01:26 cl-01 systemd[1]: ceph-mds@cl-01.service: Failed with result 'signal'.
Aug 31 18:01:26 cl-01 systemd[1]: Stopped Ceph metadata server daemon.
Aug 31 18:10:15 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 18:10:15 cl-01 ceph-mds[17838]: starting mds.cl-01 at
Aug 31 18:10:38 cl-01 systemd[1]: Stopping Ceph metadata server daemon...
Aug 31 18:10:38 cl-01 ceph-mds[17838]: 2019-08-31 18:10:38.908 7f2ec5ded700 -1 received  signal: Terminated from /sbin/init nofb  (PID: 1) UID: 0
Aug 31 18:10:38 cl-01 ceph-mds[17838]: 2019-08-31 18:10:38.908 7f2ec5ded700 -1 mds.cl-01 *** got signal Terminated ***
Aug 31 18:10:39 cl-01 ceph-mds[17838]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f2ec903e340 time 2019-08-31 18:10:39.383807
Aug 31 18:10:39 cl-01 ceph-mds[17838]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f2eca8e35c6]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  2: (()+0x28079e) [0x7f2eca8e379e]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  3: (()+0x3dcfb3) [0x557766e45fb3]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  4: (()+0x39d8c) [0x7f2ec9573d8c]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  5: (()+0x39eba) [0x7f2ec9573eba]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  6: (__libc_start_main()+0xf2) [0x7f2ec955e0a2]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  7: (_start()+0x2a) [0x557766ba58ca]
Aug 31 18:10:39 cl-01 ceph-mds[17838]: *** Caught signal (Segmentation fault) **
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  in thread 7f2ec903e340 thread_name:ceph-mds
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  1: (()+0x12730) [0x7f2ec9cac730]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  2: (__pthread_mutex_lock()+0) [0x7f2ec9ca46c0]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7f2ecac560b1]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7f2eca8e36c0]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  5: (()+0x28079e) [0x7f2eca8e379e]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  6: (()+0x3dcfb3) [0x557766e45fb3]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  7: (()+0x39d8c) [0x7f2ec9573d8c]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  8: (()+0x39eba) [0x7f2ec9573eba]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  9: (__libc_start_main()+0xf2) [0x7f2ec955e0a2]
Aug 31 18:10:39 cl-01 ceph-mds[17838]:  10: (_start()+0x2a) [0x557766ba58ca]
Aug 31 18:10:39 cl-01 systemd[1]: ceph-mds@cl-01.service: Main process exited, code=killed, status=11/SEGV
Aug 31 18:10:39 cl-01 systemd[1]: ceph-mds@cl-01.service: Failed with result 'signal'.
Aug 31 18:10:39 cl-01 systemd[1]: Stopped Ceph metadata server daemon.
Aug 31 18:11:39 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 18:11:39 cl-01 ceph-mds[20064]: starting mds.cl-01 at
 
I removed nofb from grub, still not working:

Code:
- Logs begin at Sat 2019-08-31 18:22:58 PDT, end at Sat 2019-08-31 18:24:33 PDT. --
Aug 31 18:23:02 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 18:23:11 cl-01 ceph-mds[1229]: starting mds.cl-01 at
Aug 31 18:24:23 cl-01 systemd[1]: Stopping Ceph metadata server daemon...
Aug 31 18:24:23 cl-01 ceph-mds[1229]: 2019-08-31 18:24:23.324 7fe5ba354700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
Aug 31 18:24:23 cl-01 ceph-mds[1229]: 2019-08-31 18:24:23.324 7fe5ba354700 -1 mds.cl-01 *** got signal Terminated ***
Aug 31 18:24:23 cl-01 ceph-mds[1229]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7fe5bd5a5340 time 2019-08-31 18:24:23.849195
Aug 31 18:24:23 cl-01 ceph-mds[1229]: /root/sources/pve/ceph/ceph-14.2.2/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fe5bee4a5c6]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  2: (()+0x28079e) [0x7fe5bee4a79e]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  3: (()+0x3dcfb3) [0x561fd38d3fb3]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  4: (()+0x39d8c) [0x7fe5bdadad8c]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  5: (()+0x39eba) [0x7fe5bdadaeba]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  6: (__libc_start_main()+0xf2) [0x7fe5bdac50a2]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  7: (_start()+0x2a) [0x561fd36338ca]
Aug 31 18:24:23 cl-01 ceph-mds[1229]: *** Caught signal (Segmentation fault) **
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  in thread 7fe5bd5a5340 thread_name:ceph-mds
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable)
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  1: (()+0x12730) [0x7fe5be213730]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  2: (__pthread_mutex_lock()+0) [0x7fe5be20b6c0]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7fe5bf1bd0b1]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7fe5bee4a6c0]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  5: (()+0x28079e) [0x7fe5bee4a79e]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  6: (()+0x3dcfb3) [0x561fd38d3fb3]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  7: (()+0x39d8c) [0x7fe5bdadad8c]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  8: (()+0x39eba) [0x7fe5bdadaeba]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  9: (__libc_start_main()+0xf2) [0x7fe5bdac50a2]
Aug 31 18:24:23 cl-01 ceph-mds[1229]:  10: (_start()+0x2a) [0x561fd36338ca]
Aug 31 18:24:23 cl-01 systemd[1]: ceph-mds@cl-01.service: Main process exited, code=killed, status=11/SEGV
Aug 31 18:24:23 cl-01 systemd[1]: ceph-mds@cl-01.service: Failed with result 'signal'.
Aug 31 18:24:23 cl-01 systemd[1]: Stopped Ceph metadata server daemon.
Aug 31 18:24:33 cl-01 systemd[1]: Started Ceph metadata server daemon.
Aug 31 18:24:33 cl-01 ceph-mds[2060]: starting mds.cl-01 at
 
What ceph services did you install before creating the MDS?

And did you go through our documentation?
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pveceph_fs

Hi there.

I installed proxmox 6, created a cluster, installed ceph through the proxmox GUI and then created the OSD through the GUI, along with cephfs and the MDS.

This has to be from the ceph upgrade, since the prior version (14.2.1 or something) worked fine.

Nothing was changed on proxmox. It's a default install, with two linux bridges installed. The cluster runs on the public interface and cephfs runs on the private network. Cephfs is only configured to use the private network.

I installed Fedora 30, installed ceph - worked fine. So it's not a hardware issue.

I did run into a snag in the installation documents when creating the MDS and ran into the same issue with Proxmox. Here is what the install instructions specified:

Code:
ceph auth get-or-create mgr.cl-01 mon 'allow profile mgr' osd 'allow *' mds 'allow *'

This is the code that worked:

Code:
sudo ceph auth get-or-create mds.cl-01 mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-cl-01/keyring

My guess is proxmox might be having this issue as well.
 


there seems to be a problem after all.

i have a 3 node proxmox cluster and installed ceph via webinterface on all nodes,

creating a MDS for cephfs fails on all nodes equaly with same error.

any hints what i could do to make this work?


edit: i destroyed the mds and the cephfs via CLI as described in your docs, then created a new MDS, got an up:standby as status. then i tried to create a cephfs and it timed out trying to reach an mds.

log from creating the MDS (already with a segfault): https://pastebin.com/Kw62gMRD

when creating a cephfs the status of the newly created MDS goes from standby to up:creating and hangs.


Code:
creating data pool 'cephfs_test_data'...
creating metadata pool 'cephfs_test_metadata'...
configuring new CephFS 'cephfs_test'
Successfully create CephFS 'cephfs_test'
Adding 'cephfs_test' to storage configuration...
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
Waiting for an MDS to become active
TASK ERROR: Need MDS to add storage, but none got active!


the cephfs does show up despite the failure of the mds and the status page for ceph says:

Code:
mdssonne(mds.0): 31 slow metadata IOs are blocked > 30 secs, oldest blocked for 117 secs


i would appreciate any pointers how to make this work
 
there seems to be a problem after all.

i have a 3 node proxmox cluster and installed ceph via webinterface on all nodes,

creating a MDS for cephfs fails on all nodes equaly with same error.

any hints what i could do to make this work?

I found completely reinstalling from scratch fixed the issue. Only this time I didn't install from the ISO, I installed from debian 10 manually. I'm also not on a fully upgraded proxmox install, so it could also be one of the upgrades causing the problem.
 
I found completely reinstalling from scratch fixed the issue. Only this time I didn't install from the ISO, I installed from debian 10 manually. I'm also not on a fully upgraded proxmox install, so it could also be one of the upgrades causing the problem.


it seems that debian 10 (buster) is on this version of ceph:
  • buster (stable) (admin): distributed storage and file system
    12.2.11+dfsg1-2.1: all
but IIRC proxmox is on 14.2.2

i would like to use the newer version because some features were introduced that i need.
 
it seems that debian 10 (buster) is on this version of ceph:
  • buster (stable) (admin): distributed storage and file system
    12.2.11+dfsg1-2.1: all
but IIRC proxmox is on 14.2.2

i would like to use the newer version because some features were introduced that i need.

Okay?? I'm on 14.2.2. When you install proxmox on debian it uses their repositories.
 
Okay?? I'm on 14.2.2. When you install proxmox on debian it uses their repositories.

i didnt know that. can you please point me in the right direction?

also it really doesnt negate that proxmox seems to have a builtin fault (at least in the default distribution).
 
i didnt know that. can you please point me in the right direction?

also it really doesnt negate that proxmox seems to have a builtin fault (at least in the default distribution).

Yeah I mean I made this thread for a reason. Heres hoping they fix it. :)
 
I ran into this issue as well, the MDS is always in up:creating status, and the cephfs does not create with the '
Need MDS to add storage, but none got active!' message

Could you find the cause of the issue ?
 
OK, I figured out that the pool replication was not properly sized (so in my case the replication level was not good), and when the min_size did change, it went to active, and then the cephfs was functionnal.
 
the orignal pool used osd_pool_default_min_size = 2 osd_pool_default_size = 3

then I changed to 1/1 (I know this is bad but this is a test cluster of one device) and the MDS status changed from creating to active.
 
  • Like
Reactions: Syrrys
I think what happened is that the default CRUSH rule sets the failure domain to `host`. If we only have one node we should create a new rule and set the failure domain to `osd`. Then we can change to pools to use the newly created rule
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!