ceph tooling fault when creating MDS

kugelzucker

New Member
Sep 22, 2019
7
0
1
24
good evening,

i posted in another thread (https://forum.proxmox.com/threads/proxmox-6-ceph-mds-stuck-on-creating.57524/#post-268549) that was created on the same topic and just hopped on to it, but that thread seems to be dead. so i am trying my luck here to see if this is a general problem with the proxmox ceph implementation or something else. i am aware that i dont have a subscription and that this post might be lower priority because of it.


fresh ceph node, no clustering, getting all updates after install (non-subscription-repo) and then used the webgui to install ceph on the node.

ceph installs fine, i create OSDs and they show up. i can create a pool, all good.

i create an MDS in order to create a cephfs and the MDS shows up as standby. i then create the cephfs which hangs during creation with the reason that no MDS has responded.

restarting the created MDS gives this syslog (gotten via the webgui):

-- Logs begin at Sat 2019-11-02 23:33:41 CET, end at Sat 2019-11-02 23:47:46 CET. -- Nov 02 23:45:29 mond systemd[1]: Started Ceph metadata server daemon. Nov 02 23:45:29 mond ceph-mds[9643]: starting mds.mond at Nov 02 23:47:13 mond ceph-mds[9643]: 2019-11-02 23:47:13.661 7f1eef9df700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 Nov 02 23:47:13 mond ceph-mds[9643]: 2019-11-02 23:47:13.661 7f1eef9df700 -1 mds.mond *** got signal Terminated *** Nov 02 23:47:13 mond systemd[1]: Stopping Ceph metadata server daemon... Nov 02 23:47:17 mond ceph-mds[9643]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f1ef2c30340 time 2019-11-02 23:47:17.624970 Nov 02 23:47:17 mond ceph-mds[9643]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: 91: FAILED ceph_assert(_head.empty()) Nov 02 23:47:17 mond ceph-mds[9643]: ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable) Nov 02 23:47:17 mond ceph-mds[9643]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f1ef44d460e] Nov 02 23:47:17 mond ceph-mds[9643]: 2: (()+0x2807e6) [0x7f1ef44d47e6] Nov 02 23:47:17 mond ceph-mds[9643]: 3: (()+0x3dd983) [0x55732615f983] Nov 02 23:47:17 mond ceph-mds[9643]: 4: (()+0x39d8c) [0x7f1ef3165d8c] Nov 02 23:47:17 mond ceph-mds[9643]: 5: (()+0x39eba) [0x7f1ef3165eba] Nov 02 23:47:17 mond ceph-mds[9643]: 6: (__libc_start_main()+0xf2) [0x7f1ef31500a2] Nov 02 23:47:17 mond ceph-mds[9643]: 7: (_start()+0x2a) [0x557325ebe98a] Nov 02 23:47:17 mond ceph-mds[9643]: *** Caught signal (Segmentation fault) ** Nov 02 23:47:17 mond ceph-mds[9643]: in thread 7f1ef2c30340 thread_name:ceph-mds Nov 02 23:47:17 mond ceph-mds[9643]: ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable) Nov 02 23:47:17 mond ceph-mds[9643]: 1: (()+0x12730) [0x7f1ef389e730] Nov 02 23:47:17 mond ceph-mds[9643]: 2: (__pthread_mutex_lock()+0) [0x7f1ef38966c0] Nov 02 23:47:17 mond ceph-mds[9643]: 3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7f1ef4848371] Nov 02 23:47:17 mond ceph-mds[9643]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7f1ef44d4708] Nov 02 23:47:17 mond ceph-mds[9643]: 5: (()+0x2807e6) [0x7f1ef44d47e6] Nov 02 23:47:17 mond ceph-mds[9643]: 6: (()+0x3dd983) [0x55732615f983] Nov 02 23:47:17 mond ceph-mds[9643]: 7: (()+0x39d8c) [0x7f1ef3165d8c] Nov 02 23:47:17 mond ceph-mds[9643]: 8: (()+0x39eba) [0x7f1ef3165eba] Nov 02 23:47:17 mond ceph-mds[9643]: 9: (__libc_start_main()+0xf2) [0x7f1ef31500a2] Nov 02 23:47:17 mond ceph-mds[9643]: 10: (_start()+0x2a) [0x557325ebe98a] Nov 02 23:47:17 mond systemd[1]: ceph-mds@mond.service: Main process exited, code=killed, status=11/SEGV Nov 02 23:47:17 mond systemd[1]: ceph-mds@mond.service: Failed with result 'signal'. Nov 02 23:47:17 mond systemd[1]: Stopped Ceph metadata server daemon. Nov 02 23:47:45 mond systemd[1]: Started Ceph metadata server daemon. Nov 02 23:47:46 mond ceph-mds[11092]: starting mds.mond at


i took some screenshots to illustrate: https://imgur.com/a/WpmR7ZW


please let me know what i am doing wrong or if ceph is broken on the current proxmox release.
 
What is ceph -s showing?

fresh ceph node, no clustering, getting all updates after install (non-subscription-repo) and then used the webgui to install ceph on the node.
And what does the no clustering mean?
 
hello alwin,

thanks for looking into this.

root@mond:~# ceph -s cluster: id: 118d68a0-354b-4bb5-ae81-91cd53f26227 health: HEALTH_WARN Reduced data availability: 96 pgs inactive services: mon: 1 daemons, quorum mond (age 41s) mgr: mond(active, since 24s) osd: 4 osds: 4 up (since 32s), 4 in (since 21h); 32 remapped pgs data: pools: 1 pools, 128 pgs objects: 0 objects, 0 B usage: 4.0 GiB used, 6.8 TiB / 6.8 TiB avail pgs: 75.000% pgs not active 96 undersized+peered 32 active+undersized+remapped

the status right now seems to be off, because i tried to recreate the cephfs a few times and then cleaned up the pools that remained behind from the failed attempts. i cant tell if this matters or not, but i am trying to clean up the last remaining pool (which weirdly doesnt want to go but gives the status "checking storage 'transfer' for RBD images.." and then nothing happens. if all things fail i can reimage the server and go from scratch if that helps.

my main concern is that i am getting a log that reports a crash when restarting the mds. i really want to understand how this all connects and how i can fix it so that others wont run into the same problem.

with no clustering i mean that i have only one proxmox node and not several linked.
 
after wrangling pveceph with no luck i removed the pool with ceph osd pool delete without any issues.

this is what i got after a while when using the webinterface:



Code:
checking storage 'transfer' for RBD images..
failed to remove storage 'transfer': delete storage failed: error with cfs lock 'file-storage_cfg': storage 'transfer' does not exists

TASK ERROR: failed to remove (some) storages - check log and remove manually!


now without any pools, no mds and no cephfs this is my status:

Code:
root@mond:~# ceph -s
  cluster:
    id:     118d68a0-354b-4bb5-ae81-91cd53f26227
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum mond (age 11m)
    mgr: mond(active, since 10m)
    osd: 4 osds: 4 up (since 11m), 4 in (since 21h)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   4.0 GiB used, 6.8 TiB / 6.8 TiB avail
    pgs:


this is what i get after creating the mds via the webinterface (status up:standby):
Code:
root@mond:~# ceph -s

  cluster:

    id:     118d68a0-354b-4bb5-ae81-91cd53f26227

    health: HEALTH_OK



  services:

    mon: 1 daemons, quorum mond (age 15m)

    mgr: mond(active, since 14m)

    osd: 4 osds: 4 up (since 14m), 4 in (since 21h)



  data:

    pools:   0 pools, 0 pgs

    objects: 0 objects, 0 B

    usage:   4.0 GiB used, 6.8 TiB / 6.8 TiB avail

    pgs:



after that i am creating a cephfs via the webinterface for 5 OSD (4 at the moment, but expected to grow) with 256 pg as suggested by the online calculator.

the creation times out with no mds responding.

now i restart the mds via the webinterface and i am getting this again:
Code:
-- Logs begin at Mon 2019-11-04 20:09:16 CET, end at Mon 2019-11-04 20:37:39 CET. --
Nov 04 20:24:15 mond systemd[1]: Started Ceph metadata server daemon.
Nov 04 20:24:15 mond ceph-mds[8130]: starting mds.mond at
Nov 04 20:37:31 mond systemd[1]: Stopping Ceph metadata server daemon...
Nov 04 20:37:31 mond ceph-mds[8130]: 2019-11-04 20:37:31.391 7f8d398a7700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
Nov 04 20:37:31 mond ceph-mds[8130]: 2019-11-04 20:37:31.391 7f8d398a7700 -1 mds.mond *** got signal Terminated ***
Nov 04 20:37:31 mond ceph-mds[8130]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f8d3caf8340 time 2019-11-04 20:37:31.476754
Nov 04 20:37:31 mond ceph-mds[8130]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Nov 04 20:37:31 mond ceph-mds[8130]:  ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)
Nov 04 20:37:31 mond ceph-mds[8130]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f8d3e39c60e]
Nov 04 20:37:31 mond ceph-mds[8130]:  2: (()+0x2807e6) [0x7f8d3e39c7e6]
Nov 04 20:37:31 mond ceph-mds[8130]:  3: (()+0x3dd983) [0x56000416c983]
Nov 04 20:37:31 mond ceph-mds[8130]:  4: (()+0x39d8c) [0x7f8d3d02dd8c]
Nov 04 20:37:31 mond ceph-mds[8130]:  5: (()+0x39eba) [0x7f8d3d02deba]
Nov 04 20:37:31 mond ceph-mds[8130]:  6: (__libc_start_main()+0xf2) [0x7f8d3d0180a2]
Nov 04 20:37:31 mond ceph-mds[8130]:  7: (_start()+0x2a) [0x560003ecb98a]
Nov 04 20:37:31 mond ceph-mds[8130]: *** Caught signal (Segmentation fault) **
Nov 04 20:37:31 mond ceph-mds[8130]:  in thread 7f8d3caf8340 thread_name:ceph-mds
Nov 04 20:37:31 mond ceph-mds[8130]:  ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)
Nov 04 20:37:31 mond ceph-mds[8130]:  1: (()+0x12730) [0x7f8d3d766730]
Nov 04 20:37:31 mond ceph-mds[8130]:  2: (__pthread_mutex_lock()+0) [0x7f8d3d75e6c0]
Nov 04 20:37:31 mond ceph-mds[8130]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7f8d3e710371]
Nov 04 20:37:31 mond ceph-mds[8130]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7f8d3e39c708]
Nov 04 20:37:31 mond ceph-mds[8130]:  5: (()+0x2807e6) [0x7f8d3e39c7e6]
Nov 04 20:37:31 mond ceph-mds[8130]:  6: (()+0x3dd983) [0x56000416c983]
Nov 04 20:37:31 mond ceph-mds[8130]:  7: (()+0x39d8c) [0x7f8d3d02dd8c]
Nov 04 20:37:31 mond ceph-mds[8130]:  8: (()+0x39eba) [0x7f8d3d02deba]
Nov 04 20:37:31 mond ceph-mds[8130]:  9: (__libc_start_main()+0xf2) [0x7f8d3d0180a2]
Nov 04 20:37:31 mond ceph-mds[8130]:  10: (_start()+0x2a) [0x560003ecb98a]
Nov 04 20:37:31 mond systemd[1]: ceph-mds@mond.service: Main process exited, code=killed, status=11/SEGV
Nov 04 20:37:31 mond systemd[1]: ceph-mds@mond.service: Failed with result 'signal'.
Nov 04 20:37:31 mond systemd[1]: Stopped Ceph metadata server daemon.
Nov 04 20:37:31 mond systemd[1]: Started Ceph metadata server daemon.
Nov 04 20:37:31 mond ceph-mds[12964]: starting mds.mond at
Nov 04 20:37:37 mond systemd[1]: Stopping Ceph metadata server daemon...
Nov 04 20:37:37 mond ceph-mds[12964]: 2019-11-04 20:37:37.135 7f772acf8700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
Nov 04 20:37:37 mond ceph-mds[12964]: 2019-11-04 20:37:37.135 7f772acf8700 -1 mds.mond *** got signal Terminated ***
Nov 04 20:37:39 mond ceph-mds[12964]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f772df49340 time 2019-11-04 20:37:39.715202
Nov 04 20:37:39 mond ceph-mds[12964]: /mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())
Nov 04 20:37:39 mond ceph-mds[12964]:  ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)
Nov 04 20:37:39 mond ceph-mds[12964]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f772f7ed60e]
Nov 04 20:37:39 mond ceph-mds[12964]:  2: (()+0x2807e6) [0x7f772f7ed7e6]
Nov 04 20:37:39 mond ceph-mds[12964]:  3: (()+0x3dd983) [0x55872a363983]
Nov 04 20:37:39 mond ceph-mds[12964]:  4: (()+0x39d8c) [0x7f772e47ed8c]
Nov 04 20:37:39 mond ceph-mds[12964]:  5: (()+0x39eba) [0x7f772e47eeba]
Nov 04 20:37:39 mond ceph-mds[12964]:  6: (__libc_start_main()+0xf2) [0x7f772e4690a2]
Nov 04 20:37:39 mond ceph-mds[12964]:  7: (_start()+0x2a) [0x55872a0c298a]
Nov 04 20:37:39 mond ceph-mds[12964]: *** Caught signal (Segmentation fault) **
Nov 04 20:37:39 mond ceph-mds[12964]:  in thread 7f772df49340 thread_name:ceph-mds
Nov 04 20:37:39 mond ceph-mds[12964]:  ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)
Nov 04 20:37:39 mond ceph-mds[12964]:  1: (()+0x12730) [0x7f772ebb7730]
Nov 04 20:37:39 mond ceph-mds[12964]:  2: (__pthread_mutex_lock()+0) [0x7f772ebaf6c0]
Nov 04 20:37:39 mond ceph-mds[12964]:  3: (ceph::logging::Log::submit_entry(ceph::logging::Entry&&)+0x41) [0x7f772fb61371]
Nov 04 20:37:39 mond ceph-mds[12964]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x24c) [0x7f772f7ed708]
Nov 04 20:37:39 mond ceph-mds[12964]:  5: (()+0x2807e6) [0x7f772f7ed7e6]
Nov 04 20:37:39 mond ceph-mds[12964]:  6: (()+0x3dd983) [0x55872a363983]
Nov 04 20:37:39 mond ceph-mds[12964]:  7: (()+0x39d8c) [0x7f772e47ed8c]
Nov 04 20:37:39 mond ceph-mds[12964]:  8: (()+0x39eba) [0x7f772e47eeba]
Nov 04 20:37:39 mond ceph-mds[12964]:  9: (__libc_start_main()+0xf2) [0x7f772e4690a2]
Nov 04 20:37:39 mond ceph-mds[12964]:  10: (_start()+0x2a) [0x55872a0c298a]
Nov 04 20:37:39 mond systemd[1]: ceph-mds@mond.service: Main process exited, code=killed, status=11/SEGV
Nov 04 20:37:39 mond systemd[1]: ceph-mds@mond.service: Failed with result 'signal'.
Nov 04 20:37:39 mond systemd[1]: Stopped Ceph metadata server daemon.
Nov 04 20:37:39 mond systemd[1]: Started Ceph metadata server daemon.
Nov 04 20:37:39 mond ceph-mds[13090]: starting mds.mond at

seqfault again.

the pool is created for the cephfs but not active:

Code:
root@mond:~# ceph -s
  cluster:
    id:     118d68a0-354b-4bb5-ae81-91cd53f26227
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 320 pgs inactive
            Degraded data redundancy: 320 pgs undersized

  services:
    mon: 1 daemons, quorum mond (age 30m)
    mgr: mond(active, since 29m)
    mds: cephfs:1 {0=mond=up:creating}
    osd: 4 osds: 4 up (since 30m), 4 in (since 22h)

  data:
    pools:   2 pools, 320 pgs
    objects: 0 objects, 0 B
    usage:   4.0 GiB used, 6.8 TiB / 6.8 TiB avail
    pgs:     100.000% pgs not active
             320 undersized+peered
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!