[SOLVED] pveceph not creating ceph.conf symlink and ceph mon crashing

mbaldini

Well-Known Member
Nov 7, 2015
170
22
58
Hello.
I don't know if the two things stated in the thread title are related, I'm going to explain my problem.

I have a 3 nodes PVE 5 cluster and I'm testing CEPH on this cluster. I had a healtly ceph cluster for a week, each node has ceph mon + ceph mgr (created with pveceph createmon) and 2 or 3 OSDs (created from web interface). Ceph is running on his dedicated network together with corosync (I know it's not optimal but for now it's to test ceph and there is not a big load on ceph)

my pveversion on all three nodes:
Code:
pve-hs-main[0]:~$ pveversion -v
proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.1-1
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-pve1

Yesterday I made a test gracefully removing a node pve-hs3 / cluster-3 (migrated every VM/CT, out+stop every OSD, destroyed every OSD, destroyed mon on that node), powered off node and removed from PVE cluster as per instructions with pvecm delnode pve-hs-3

I installed the node from scrath, new installation with Proxmox ISO, I used same IPs and same hostname but the installation is totally new. Joined the cluster with
Code:
pvecm add 10.10.10.251 -ring0_addr  cluster-3

My /etc/hosts on each node
Code:
pve-hs-main[0]:~$ cat /etc/hosts
127.0.0.1 localhost.localdomain localhost

192.168.2.251 pve-hs-main.local     pve-hs-main     pvelocalhost
192.168.2.252 pve-hs-2.local        pve-hs-2
192.168.2.253 pve-hs-3.local        pve-hs-3


10.10.10.251 cluster-main.local      cluster-main
10.10.10.252 cluster-2.local         cluster-2
10.10.10.253 cluster-3.local         cluster-3

-cut-
pvelocalhost is on the right IP on each node

my actual pvecm status
Code:
pve-hs-main[0]:~$ pvecm status
Quorum information
------------------
Date:             Tue Oct  3 08:14:54 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/272
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.251 (local)
0x00000002          1 10.10.10.252
0x00000003          1 10.10.10.253

Code:
pve-hs-main[0]:~$ pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 cluster-main (local)
         2          1 cluster-2
         3          1 cluster-3

I then proceeded to install ceph on the reinstalled node with the usual pveceph install --version luminous, everything went fine but I noticed errors if I use ceph -s command. I discovered that symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf was not automatically created, that is strange, however I did it by hand with ln -s. After this all ceph commands started to work, so I decided to create the mon+mgr on this host with pveceph createmon. On ceph.conf everything seems right

Code:
[global]
         auth client required = none
         auth cluster required = none
         auth service required = none
         bluestore_block_db_size = 64424509440
         cluster network = 10.10.10.0/24
         fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.10.10.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve-hs-main]
         host = pve-hs-main
         mon addr = 10.10.10.251:6789

[mon.pve-hs-3]
         host = pve-hs-3
         mon addr = 10.10.10.253:6789

[mon.pve-hs-2]
         host = pve-hs-2
         mon addr = 10.10.10.252:6789

After monitor is created and ceph started to use it, the process randomly crashes with these messages in syslog:
https://pastebin.com/vebhShDH

Then mon start crashing and reloading until I remove the mon with pveceph destromon pve-hs-3

I am running the ceph cluster with 2 mons (on the other 2 nodes), I added 3 OSDs on the node with the mon problem and everything is working fine, only if I create a mon on that node it has problems. Before reinstalling the node, it had a ceph mon without problems.

Any ideas? I can even reinstall again from scratch the node if needed.
 
Last edited:
Aren't your hosts for ceph called cluster-main, cluster-2, cluster-3? Your ceph.conf uses different names.
 
Node 1 has hostname pve-hs-main on IP 192.168.2.251 (the bridge LAN used by VMs/CT), then it has IP 10.10.10.251 (dedicated network for corosync and ceph) and on file hosts it's called cluster-main (I have put content of my /etc/hosts file)
Same for node 2 (hostname pve-hs-2 and cluster-2 on dedicated network) and for node 3 (hostname pve-hs-3 and cluster-3 on dedicated network). Node 3 is the one having problems with ceph monitor.

The names in ceph.conf are automatically added by pveceph createmon

There is something wrong in this setup? The different names comes from the guide https://pve.proxmox.com/wiki/Separate_Cluster_Network
 
What is 'ceph -s' showing?

Code:
pve-hs-2[0]:~$ ceph -s
  cluster:
    id:     24d5d6bc-0943-4345-b44e-46c19099004b
    health: HEALTH_OK

  services:
    mon: 2 daemons, quorum pve-hs-main,pve-hs-2
    mgr: pve-hs-main(active), standbys: pve-hs-2
    osd: 7 osds: 7 up, 7 in

  data:
    pools:   2 pools, 512 pgs
    objects: 52146 objects, 200 GB
    usage:   956 GB used, 5923 GB / 6879 GB avail
    pgs:     512 active+clean

  io:
    client:   200 kB/s rd, 38629 kB/s wr, 31 op/s rd, 40 op/s wr
I only have 2 mon because adding the mon on pve-hs-3 will start again crashing.


This setup is for a separate corosync network and has nothing to do with ceph.
Yes, but actually ceph and corosync share the same 10.10.10.x network, I know it's not optimal but I can't change this at the moment. So when I made the pveceh init, I used
Code:
pveceph init --network 10.10.10.0/24

Moreover, before the reinstallation of the third node pve-hs-3, this setup was working with the 3 nodes with ceph mon and mgr on each one.
The strange thing (IMHO) is that pveceph install did not create automatically the symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf when I installed ceph on the third node
 
I made a pveceph init only on one node, for the creation of the ceph cluster. Then I did as you say a pveceph createmon on each node. After that, I started adding OSDs, and that was working for about a week (ceph -s was saying HEALTH_OK)

To make a test, I removed all OSDs (pveceph destroyosd) and mon (pveceph destroymon) on node pve-hs-3, removed the node from PVE cluster (pvecm delnode pve-hs-3) and reinstalled from scratch the node (simulating a node failure or node change), added back to the PVE cluster, installed ceph with pveceph install --version luminous, and created a mon with pveceph createmon. And the monitor process started to crash, as you can see from the syslog posted here: https://pastebin.com/vebhShDH
I destroyed the monitor and then, as soon as I try to recreate it, start crashing again, and ceph -s show 1 MON_DOWN.

Each node has PVE 5, apt update && pveupgrade show "Your System is up-to-date" and
Code:
pve-hs-main[0]:~$ cat /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-luminous stretch main

I think I'll try again removing the node and installing it again from scratch, but I fear it will show the same problem
 
In the ceph.conf, you still have three mons, as you only have two mons now, you should remove the stall mon. I guess, also your storage.conf has still all three mons in the configuration, that may create trouble too, if you add the third node back and it expects to talk to a monitor on the ip where no one is.
 
I posted my ceph.conf with the problematic monitor created, after I issued pveceph destroymon pve-hs-3, I have

Code:
pve-hs-main[0]:~$ cat /etc/pve/ceph.conf
[global]
         auth client required = none
         auth cluster required = none
         auth service required = none
         bluestore_block_db_size = 64424509440
         cluster network = 10.10.10.0/24
         fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.10.10.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve-hs-main]
         host = pve-hs-main
         mon addr = 10.10.10.251:6789

[mon.pve-hs-2]
         host = pve-hs-2
         mon addr = 10.10.10.252:6789


Code:
pve-hs-main[0]:~$ cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

zfspool: local-zfs
        pool rpool/data
        content images,rootdir
        sparse 1

dir: USB-MAIN
        path /mnt/USB-BACKUP
        content backup
        is_mountpoint 1
        maxfiles 2
        nodes pve-hs-main
        shared 0

rbd: cephwin_vm
        content images
        krbd 0
        pool cephwin

rbd: cephlinux_vm
        content images
        krbd 0
        pool cephlinux

rbd: cephlinux_ct
        content rootdir
        krbd 1
        pool cephlinux


in fact ceph -s just see two monitors
Code:
pve-hs-main[0]:~$ ceph -s
  cluster:
    id:     24d5d6bc-0943-4345-b44e-46c19099004b
    health: HEALTH_OK

  services:
    mon: 2 daemons, quorum pve-hs-main,pve-hs-2
    mgr: pve-hs-main(active), standbys: pve-hs-2
    osd: 7 osds: 7 up, 7 in

  data:
    pools:   2 pools, 512 pgs
    objects: 51571 objects, 198 GB
    usage:   948 GB used, 5931 GB / 6879 GB avail
    pgs:     512 active+clean

  io:
    client:   10916 B/s wr, 0 op/s rd, 1 op/s wr
 
What is the 'pveversion -v' output? And what are the ceph versions?
 
pveversion -v on each node

Code:
pve-hs-main:~$ pveversion -v
proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.1-1
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-pve1

Code:
pve-hs-2:~$ pveversion -v
proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.1-1
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-pve1

Code:
pve-hs-3:/etc/pve$ pveversion -v
proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.1-1
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-pve1


ceph versions
Code:
pve-hs-2:~$ ceph versions
{
    "mon": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 2
    },
    "mgr": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 1,
        "unknown": 1
    },
    "osd": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 7
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 10,
        "unknown": 1
    }
}
 
"mgr": { "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 1, "unknown": 1
There should be two mgr showing up and no "unknown", try to restart your manager and check the logs if something is showing up. Also, is the hardware in good condition (to also rule out)?
 
Restarted ceph-mgr service on node 2, now versions are ok
Code:
pve-hs-3[0]:~$ ceph versions
{
    "mon": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 2
    },
    "mgr": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 2
    },
    "osd": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 7
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 11
    }
}

I try to re-create monitor and manager on node 3
Code:
pve-hs-3[0]:~$ pveceph createmon
admin_socket: exception getting command descriptions: [Errno 111] Connection refused
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'
INFO:ceph-create-keys:Key exists already: /etc/ceph/ceph.client.admin.keyring
INFO:ceph-create-keys:Key exists already: /var/lib/ceph/bootstrap-osd/ceph.keyring
INFO:ceph-create-keys:Key exists already: /var/lib/ceph/bootstrap-rgw/ceph.keyring
INFO:ceph-create-keys:Key exists already: /var/lib/ceph/bootstrap-mds/ceph.keyring
INFO:ceph-create-keys:Key exists already: /var/lib/ceph/bootstrap-rbd/ceph.keyring
creating manager directory '/var/lib/ceph/mgr/ceph-pve-hs-3'
creating keys for 'mgr.pve-hs-3'
setting owner for directory
enabling service 'ceph-mgr@pve-hs-3.service'
Created symlink /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@pve-hs-3.service -> /lib/systemd/system/ceph-mgr@.service.
starting service 'ceph-mgr@pve-hs-3.service'

Actual ceph versions, 3 monitors and 3 managers
Code:
pve-hs-3[0]:~$ ceph versions
{
    "mon": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 3
    },
    "mgr": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 3
    },
    "osd": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 7
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)": 13
    }
}

Actual ceph -s
Code:
pve-hs-3[0]:~$ ceph -s
  cluster:
    id:     24d5d6bc-0943-4345-b44e-46c19099004b
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-hs-main,pve-hs-2,pve-hs-3
    mgr: pve-hs-main(active), standbys: pve-hs-2, pve-hs-3
    osd: 7 osds: 7 up, 7 in

  data:
    pools:   2 pools, 512 pgs
    objects: 51572 objects, 198 GB
    usage:   948 GB used, 5931 GB / 6879 GB avail
    pgs:     512 active+clean

  io:
    client:   77103 B/s wr, 0 op/s rd, 1 op/s wr

It seems to me that the 3 monitors and 3 managers up and installed ok. But few seconds after, monitor crashes again, here the relevant syslog entries

Code:
Oct  3 16:15:03 pve-hs-3 pveceph[4761]: <root@pam> starting task UPID:pve-hs-3:000012A8:00655B25:59D39B67:cephdestroymon:mon.pve-hs-3:root@pam:
Oct  3 16:15:03 pve-hs-3 systemd[1]: Reloading.
Oct  3 16:15:03 pve-hs-3 pveceph[4761]: <root@pam> end task UPID:pve-hs-3:000012A8:00655B25:59D39B67:cephdestroymon:mon.pve-hs-3:root@pam: OK
Oct  3 16:15:08 pve-hs-3 pveceph[4861]: <root@pam> starting task UPID:pve-hs-3:0000130A:00655CF1:59D39B6C:cephdestroymgr:mgr.pve-hs-3:root@pam:
Oct  3 16:15:08 pve-hs-3 pveceph[4874]: ceph manager directory '/var/lib/ceph/mgr/ceph-pve-hs-3' not found
Oct  3 16:15:08 pve-hs-3 pveceph[4861]: <root@pam> end task UPID:pve-hs-3:0000130A:00655CF1:59D39B6C:cephdestroymgr:mgr.pve-hs-3:root@pam: ceph manager di\
rectory '/var/lib/ceph/mgr/ceph-pve-hs-3' not found
Oct  3 16:16:28 pve-hs-3 pveceph[6353]: <root@pam> starting task UPID:pve-hs-3:000018D2:00657C27:59D39BBC:cephcreatemon:mon.pve-hs-3:root@pam:
Oct  3 16:16:28 pve-hs-3 systemd[1]: Started Ceph cluster monitor daemon.
Oct  3 16:16:28 pve-hs-3 systemd[1]: Reloading.
Oct  3 16:16:31 pve-hs-3 ceph-mon[6405]: 2017-10-03 16:16:31.621867 7fa2fa639700 -1 mon.pve-hs-3@-1(synchronizing).mgr e63 Failed to load mgr commands: (2) No such file or directory
Oct  3 16:16:38 pve-hs-3 systemd[1]: Reloading.
Oct  3 16:16:38 pve-hs-3 systemd[1]: Started Ceph cluster manager daemon.
Oct  3 16:16:38 pve-hs-3 pveceph[6353]: <root@pam> end task UPID:pve-hs-3:000018D2:00657C27:59D39BBC:cephcreatemon:mon.pve-hs-3:root@pam: OK
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]: *** Caught signal (Aborted) **
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  in thread 7fa2fa639700 thread_name:ms_dispatch
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  1: (()+0x9306d4) [0x55c29a0bb6d4]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  2: (()+0x110c0) [0x7fa303d7a0c0]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  3: (gsignal()+0xcf) [0x7fa3011a0fcf]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  4: (abort()+0x16a) [0x7fa3011a23fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  5: (()+0x407059) [0x55c299b92059]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  6: (OSDMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0xc55) [0x55c299c7ec65]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  7: (OSDMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2c0) [0x55c299c88090]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  8: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x7f8) [0x55c299c301c8]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  9: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x233b) [0x55c299af435b]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa49) [0x55c299afb739]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  11: (Monitor::_ms_dispatch(Message*)+0x6d3) [0x55c299afc7c3]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  12: (Monitor::ms_dispatch(Message*)+0x23) [0x55c299b29563]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  13: (DispatchQueue::entry()+0xeda) [0x55c29a0624fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55c299e0855d]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  15: (()+0x7494) [0x7fa303d70494]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  16: (clone()+0x3f) [0x7fa301256aff]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]: 2017-10-03 16:16:51.853428 7fa2fa639700 -1 *** Caught signal (Aborted) **
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  in thread 7fa2fa639700 thread_name:ms_dispatch
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  1: (()+0x9306d4) [0x55c29a0bb6d4]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  2: (()+0x110c0) [0x7fa303d7a0c0]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  3: (gsignal()+0xcf) [0x7fa3011a0fcf]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  4: (abort()+0x16a) [0x7fa3011a23fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  5: (()+0x407059) [0x55c299b92059]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  6: (OSDMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0xc55) [0x55c299c7ec65]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  7: (OSDMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2c0) [0x55c299c88090]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  8: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x7f8) [0x55c299c301c8]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  9: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x233b) [0x55c299af435b]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa49) [0x55c299afb739]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  11: (Monitor::_ms_dispatch(Message*)+0x6d3) [0x55c299afc7c3]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  12: (Monitor::ms_dispatch(Message*)+0x23) [0x55c299b29563]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  13: (DispatchQueue::entry()+0xeda) [0x55c29a0624fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55c299e0855d]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  15: (()+0x7494) [0x7fa303d70494]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  16: (clone()+0x3f) [0x7fa301256aff]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  -1045> 2017-10-03 16:16:31.621867 7fa2fa639700 -1 mon.pve-hs-3@-1(synchronizing).mgr e63 Failed to load mgr commands: (2) No such file or directory
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:      0> 2017-10-03 16:16:51.853428 7fa2fa639700 -1 *** Caught signal (Aborted) **
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  in thread 7fa2fa639700 thread_name:ms_dispatch
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  1: (()+0x9306d4) [0x55c29a0bb6d4]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  2: (()+0x110c0) [0x7fa303d7a0c0]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  3: (gsignal()+0xcf) [0x7fa3011a0fcf]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  4: (abort()+0x16a) [0x7fa3011a23fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  5: (()+0x407059) [0x55c299b92059]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  6: (OSDMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0xc55) [0x55c299c7ec65]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  7: (OSDMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2c0) [0x55c299c88090]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  8: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x7f8) [0x55c299c301c8]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  9: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x233b) [0x55c299af435b]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa49) [0x55c299afb739]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  11: (Monitor::_ms_dispatch(Message*)+0x6d3) [0x55c299afc7c3]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  12: (Monitor::ms_dispatch(Message*)+0x23) [0x55c299b29563]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  13: (DispatchQueue::entry()+0xeda) [0x55c29a0624fa]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55c299e0855d]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  15: (()+0x7494) [0x7fa303d70494]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  16: (clone()+0x3f) [0x7fa301256aff]
Oct  3 16:16:51 pve-hs-3 ceph-mon[6405]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  -2253> 2017-10-03 16:17:02.410913 7fd76c0faf80 -1 mon.pve-hs-3@-1(probing).mgr e64 Failed to load mgr commands: (2) No such file or directory
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:      0> 2017-10-03 16:18:28.467335 7fd761d72700 -1 *** Caught signal (Aborted) **
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  in thread 7fd761d72700 thread_name:ms_dispatch
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  1: (()+0x9306d4) [0x55f11f2326d4]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  2: (()+0x110c0) [0x7fd76b4b30c0]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  3: (gsignal()+0xcf) [0x7fd7688d9fcf]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  4: (abort()+0x16a) [0x7fd7688db3fa]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  5: (()+0x407059) [0x55f11ed09059]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  6: (OSDMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0xc55) [0x55f11edf5c65]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  7: (OSDMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2c0) [0x55f11edff090]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  8: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x7f8) [0x55f11eda71c8]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  9: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x233b) [0x55f11ec6b35b]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa49) [0x55f11ec72739]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  11: (Monitor::_ms_dispatch(Message*)+0x6d3) [0x55f11ec737c3]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  12: (Monitor::ms_dispatch(Message*)+0x23) [0x55f11eca0563]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  13: (DispatchQueue::entry()+0xeda) [0x55f11f1d94fa]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f11ef7f55d]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  15: (()+0x7494) [0x7fd76b4a9494]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  16: (clone()+0x3f) [0x7fd76898faff]
Oct  3 16:18:28 pve-hs-3 ceph-mon[7060]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct  3 16:18:28 pve-hs-3 systemd[1]: ceph-mon@pve-hs-3.service: Main process exited, code=killed, status=6/ABRT
Oct  3 16:18:28 pve-hs-3 systemd[1]: ceph-mon@pve-hs-3.service: Unit entered failed state.
Oct  3 16:18:28 pve-hs-3 systemd[1]: ceph-mon@pve-hs-3.service: Failed with result 'signal'.
 
If you can rule out a hardware issue, then could you test to stop the mgrs and start the third mon? If it comes up and stays up, then you could try to start the mgrs to see if the problem still stays.
 
Yesterday, after the business hours, I tried to do what you said.
I stopped all managers, then on the third node:
Code:
pveceph createmon --exclude-manager 1
Ceph was unhealthy because of no manager active, but all the three monitors were active and in quorum without problems.

After about an hour with CEPH working ok and monitors all ok, I tried to start the managers on two nodes and created the manager on the third node with pceveph createmgr. Everything seems fine from then, and actually ceph is healthy with 3 monitors and 3 managers ok.

Code:
pve-hs-3[0]:~$ ceph -s
  cluster:
    id:     24d5d6bc-0943-4345-b44e-46c19099004b
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-hs-main,pve-hs-2,pve-hs-3
    mgr: pve-hs-2(active), standbys: pve-hs-main, pve-hs-3
    osd: 7 osds: 7 up, 7 in

  data:
    pools:   2 pools, 512 pgs
    objects: 51745 objects, 198 GB
    usage:   950 GB used, 5929 GB / 6879 GB avail
    pgs:     512 active+clean

  io:
    client:   341 B/s rd, 102 kB/s wr, 0 op/s rd, 15 op/s wr


I think it's solved, even if I can't understand why one monitor was crashing because of some manager problem.

Thanks for the help
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!