Reinstall/remove dead monitor

EvilBox · Oct 7, 2019

Hi community!
I have updated ceph 12 to 14 according to the following document - https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus but I have a problem with monitor

Bash:

root@pve03:/etc/ceph# pveceph createmon
monitor 'pve03' already exists

Bash:

root@pve03:/etc/ceph# pveceph destroymon pve03
no such monitor id 'pve03'

Bash:

root@pve03:/etc/ceph# cat /etc/pve/ceph.conf
[global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 192.168.0.0/24
    fsid = e1ee6b28-xxxx-xxxx-xxxx-11d1f6efab9b
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 3
    public network = 192.168.0.0/24
    ms_bind_ipv4 = true
    ms_bind_ipv6 = false
[client]
    keyring = /etc/pve/priv/$cluster.$name.keyring
[mon.pve02]
    host = pve02
    mon addr = 192.168.0.57

Bash:

root@pve03:/etc/ceph# ll /var/lib/ceph/mon/
total 0

Bash:

root@pve03:/etc/ceph# ps aux | grep ceph
root 861026 0.0 0.0 17308 9120 ? Ss 19:03 0:00 /usr/bin/python2.7 /usr/bin/ceph-crash
ceph 863641 0.0 0.2 492588 169916 ? Ssl 19:08 0:04 /usr/bin/ceph-mgr -f --cluster ceph --id pve03 --setuser ceph --setgroup ceph
root 890587 0.0 0.0 6072 892 pts/0 S+ 20:43 0:00 grep ceph

Bash:

root@pve03:~# ceph mon dump
dumped monmap epoch 9
epoch 9
fsid e1ee6b28-xxxx-xxxx-xxxx-11d1f6efab9b
last_changed 2019-10-05 19:07:48.598830
created 2019-05-11 01:28:04.534419
min_mon_release 14 (nautilus)
0: [v2:192.168.0.57:3300/0,v1:192.168.0.57:6789/0] mon.pve02

Bash:

root@pve03:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Code:

syslog:
Oct 05 19:57:47 pve03 systemd[1]: Started Ceph cluster monitor daemon.
Oct 05 19:57:47 pve03 ceph-mon[875279]: 2019-10-05 19:57:47.506 7ffb1227f440 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve03' does not exist: have you run 'mkfs'?
Oct 05 19:57:47 pve03 systemd[1]: ceph-mon@pve03.service: Main process exited, code=exited, status=1/FAILURE
Oct 05 19:57:47 pve03 systemd[1]: ceph-mon@pve03.service: Failed with result 'exit-code'.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Service RestartSec=10s expired, scheduling restart.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Scheduled restart job, restart counter is at 4.
Oct 05 19:57:57 pve03 systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Start request repeated too quickly.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Failed with result 'exit-code'.
Oct 05 19:57:57 pve03 systemd[1]: Failed to start Ceph cluster monitor daemon.

Is it possible to get around the error? Thanks!

Alwin · Oct 7, 2019

See here, maybe this helps.
https://forum.proxmox.com/threads/i-managed-to-create-a-ghost-ceph-monitor.58435/#post-269799

EvilBox · Oct 7, 2019

Alwin said:
See here, maybe this helps.
https://forum.proxmox.com/threads/i-managed-to-create-a-ghost-ceph-monitor.58435/#post-269799

This solution not work for me:

Bash:

root@pve05:~# ceph-mon -i pve05 --extract-monmap tmp/map
2019-10-07 16:50:42.590 7f647b8ce440 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve05' is empty: have you run 'mkfs'?

i'm trying add monitor manually:

Bash:

ceph auth get mon. -o tmp/key
ceph mon getmap -o tmp/map
ceph-mon -i mon.pve05 --mkfs --monmap tmp/map --keyring tmp/key

and remove

Bash:

ceph mon remove mon.pve05

After reboot, ceph nodes not started. Looks like he died completely.

Bash:

root@pve05:~# ceph -s
Cluster connection aborted

Bash:

root@pve05:~# systemctl status ceph-crash.service ceph-fuse.target ceph-mds.target ceph-mgr.target ceph-mon.target ceph-osd.target ceph.target
● ceph-crash.service - Ceph crash dump collector
   Loaded: loaded (/lib/systemd/system/ceph-crash.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-10-07 16:51:50 MSK; 34min ago
 Main PID: 2616543 (ceph-crash)
    Tasks: 1 (limit: 4915)
   Memory: 5.0M
   CGroup: /system.slice/ceph-crash.service
           └─2616543 /usr/bin/python2.7 /usr/bin/ceph-crash

Oct 07 16:51:50 pve05 systemd[1]: Started Ceph crash dump collector.
Oct 07 16:51:50 pve05 ceph-crash[2616543]: INFO:__main__:monitoring path /var/lib/ceph/crash, delay 600s

● ceph-fuse.target - ceph target allowing to start/stop all ceph-fuse@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph-fuse.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:50 MSK; 34min ago

Oct 07 16:51:50 pve05 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Oct 07 16:51:50 pve05 systemd[1]: Stopping ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Oct 07 16:51:50 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph-fuse@.service instances at once.

● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:50 MSK; 34min ago

Oct 07 16:51:50 pve05 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Oct 07 16:51:50 pve05 systemd[1]: Stopping ceph target allowing to start/stop all ceph-mds@.service instances at once.
Oct 07 16:51:50 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once.

● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:50 MSK; 34min ago

Oct 07 16:51:50 pve05 systemd[1]: Stopping ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Oct 07 16:51:50 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.

● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:50 MSK; 34min ago

Oct 07 16:51:50 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.

● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph-osd.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:51 MSK; 34min ago

Oct 07 16:51:51 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.

● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
   Active: active since Mon 2019-10-07 16:51:51 MSK; 34min ago

Oct 07 16:51:51 pve05 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.

New entries in the logs do not appear:

Bash:

root@pve05:~# date
Mon 07 Oct 2019 05:15:07 PM MSK

Bash:

root@pve05:~# tail -n 11 /var/log/ceph/ceph.log
2019-10-07 14:15:26.353911 mgr.pve01 (mgr.1934238) 77718 : cluster [DBG] pgmap v77741: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:28.354646 mgr.pve01 (mgr.1934238) 77719 : cluster [DBG] pgmap v77742: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:30.355225 mgr.pve01 (mgr.1934238) 77720 : cluster [DBG] pgmap v77743: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:32.356000 mgr.pve01 (mgr.1934238) 77721 : cluster [DBG] pgmap v77744: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:34.356471 mgr.pve01 (mgr.1934238) 77722 : cluster [DBG] pgmap v77745: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:36.357053 mgr.pve01 (mgr.1934238) 77723 : cluster [DBG] pgmap v77746: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:38.357720 mgr.pve01 (mgr.1934238) 77724 : cluster [DBG] pgmap v77747: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:40.358319 mgr.pve01 (mgr.1934238) 77725 : cluster [DBG] pgmap v77748: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:42.358999 mgr.pve01 (mgr.1934238) 77726 : cluster [DBG] pgmap v77749: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:44.359448 mgr.pve01 (mgr.1934238) 77727 : cluster [DBG] pgmap v77750: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:46.360119 mgr.pve01 (mgr.1934238) 77728 : cluster [DBG] pgmap v77751: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail
2019-10-07 14:15:48.360842 mgr.pve01 (mgr.1934238) 77729 : cluster [DBG] pgmap v77752: 128 pgs: 128 active+clean; 0 B data, 335 GiB used, 4.9 TiB / 5.2 TiB avail

Alwin · Oct 8, 2019

EvilBox said:
root@pve03:/etc/ceph# pveceph createmon

Why did you want to create a MON after upgrade?

Can you please post the output of systemctl status ceph-mon*?

Search

Search

Reinstall/remove dead monitor

EvilBox

Member

Alwin

Proxmox Retired Staff

EvilBox

Member

Alwin

Proxmox Retired Staff

We value your privacy