Reinstall monitor

Mikepop

Well-Known Member
Feb 6, 2018
63
5
48
50
Hi, I've reinstalled a node following this doc https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Re-installing_a_cluster_node but I have a problem with monitor

# pveceph createmon

monitor 'mon.int101' already exists


# pveceph destroymon int101

monitor filesystem '/var/lib/ceph/mon/ceph-int101' does not exist on this node


# pvecm status

Quorum information

------------------

Date: Sun Mar 4 21:11:02 2018

Quorum provider: corosync_votequorum

Nodes: 3

Node ID: 0x00000001

Ring ID: 1/80

Quorate: Yes


Votequorum information

----------------------

Expected votes: 3

Highest expected: 3

Total votes: 3

Quorum: 2

Flags: Quorate


Membership information

----------------------

Nodeid Votes Name

0x00000001 1 10.10.10.101 (local)

0x00000002 1 10.10.10.102

0x00000003 1 10.10.10.103


Is there any way to force the monitor creation? Everything seems to work fine except quorum.

Regards
 
In the /etc/pve/ceph.conf you still have the monitor 'int101' configured.
 
Thanks, I've recreated the monitor after delete from /etc/pve/ceph.conf.

Regards
 
What if I don't have an entry for the monitor in /etc/pve/ceph.conf? It's already been removed, but I can't re-create the monitor.

I also can't manually delete /var/lib/ceph/mon/ceph-pve3
 
@CastyMcBoozer, well, the MON might be still running, if you can't even remove the directory. And did you remove through 'pveceph mon destroy' or just from the ceph.conf?
 
@CastyMcBoozer, well, the MON might be still running, if you can't even remove the directory. And did you remove through 'pveceph mon destroy' or just from the ceph.conf?

I originally removed it from the GUI. I'm not sure what service I'd be looking for running but:


root@pve3:~# ps aux | grep pve

ceph 1454 0.0 0.3 357504 29712 ? Ssl Apr22 0:18 /usr/bin/ceph-mds -f --cluster ceph --id pve3 --setuser ceph --setgroup ceph

root 1597 0.0 1.0 517172 81836 ? Ss Apr22 2:00 pve-firewall

root 1616 0.1 1.0 515192 82508 ? Ss Apr22 4:51 pvestatd

root 1712 0.0 1.3 556884 110580 ? Ss Apr22 0:01 pvedaemon

root 1717 0.0 1.4 567568 120112 ? S Apr22 0:01 pvedaemon worker

root 1718 0.0 1.4 567604 120124 ? S Apr22 0:01 pvedaemon worker

root 1719 0.0 1.4 566988 119552 ? S Apr22 0:01 pvedaemon worker

root 1873 0.0 1.1 526468 91560 ? Ss Apr22 0:09 pve-ha-crm

www-data 1898 0.0 1.6 564712 131724 ? Ss Apr22 0:02 pveproxy

root 1979 0.0 1.1 526164 91220 ? Ss Apr22 0:19 pve-ha-lrm

www-data 141158 0.0 1.4 567048 119380 ? S Apr23 0:01 pveproxy worker

www-data 141159 0.0 1.4 567048 119380 ? S Apr23 0:01 pveproxy worker

www-data 141160 0.0 1.4 567048 119380 ? S Apr23 0:01 pveproxy worker

root 377870 0.0 0.0 94144 180 ? Ssl 06:25 0:00 /usr/sbin/pvefw-logger

root 417519 0.0 0.0 12784 936 pts/0 S+ 10:26 0:00 grep pve
 
and:


ps aux | grep ceph

ceph 1454 0.0 0.3 357504 29712 ? Ssl Apr22 0:18 /usr/bin/ceph-mds -f --cluster ceph --id pve3 --setuser ceph --setgroup ceph

ceph 1705 0.1 14.7 1917600 1184888 ? Ssl Apr22 4:50 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph

root 417832 0.0 0.0 12784 944 pts/0 S+ 10:28 0:00 grep ceph
 
You can find out with 'lsof' what files/directories are open by what process at that location.
 
And can you delete it now?
 
And can you delete it now?
Are you asking if lsof fixed it? What is this? Quantum mechanics double slit experiment? LOL no nothing has changed. I could just rebuild the node entirely but I kept hearing how stable and production ready Ceph was. I’m not convinced.
 
Are you asking if lsof fixed it? What is this? Quantum mechanics double slit experiment? LOL no nothing has changed.
It's called time between actions (see the timestamps on the posts). Where a state of a running system can definitely change.
Alwin, Yesterday at 07:39
CastyMcBoozer, Yesterday at 19:14
Alwin, Today at 10:38
CastyMcBoozer, 46 minutes ago (Today 14:03)

I could just rebuild the node entirely but I kept hearing how stable and production ready Ceph was. I’m not convinced.
If you aren't sending any error message or logfiles, everything is just fishing in the dark.
 
Hi all!
Looks like I'm having a similar problem:

root@pve03:/etc/ceph# pveceph createmon
monitor 'pve03' already exists

root@pve03:/etc/ceph# pveceph destroymon pve03
no such monitor id 'pve03'

root@pve03:/etc/ceph# cat /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.0.0/24
fsid = e1ee6b28-xxxx-xxxx-xxxx-11d1f6efab9b
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.0.0/24
ms_bind_ipv4 = true
ms_bind_ipv6 = false
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mon.pve02]
host = pve02
mon addr = 192.168.0.57

root@pve03:/etc/ceph# ll /var/lib/ceph/mon/
total 0

root@pve03:/etc/ceph# ps aux | grep ceph
root 861026 0.0 0.0 17308 9120 ? Ss 19:03 0:00 /usr/bin/python2.7 /usr/bin/ceph-crash
ceph 863641 0.0 0.2 492588 169916 ? Ssl 19:08 0:04 /usr/bin/ceph-mgr -f --cluster ceph --id pve03 --setuser ceph --setgroup ceph
root 890587 0.0 0.0 6072 892 pts/0 S+ 20:43 0:00 grep ceph

root@pve03:~# ceph mon dump
dumped monmap epoch 9
epoch 9
fsid e1ee6b28-xxxx-xxxx-xxxx-11d1f6efab9b
last_changed 2019-10-05 19:07:48.598830
created 2019-05-11 01:28:04.534419
min_mon_release 14 (nautilus)
0: [v2:192.168.0.57:3300/0,v1:192.168.0.57:6789/0] mon.pve02

syslog:
Oct 05 19:57:47 pve03 systemd[1]: Started Ceph cluster monitor daemon.
Oct 05 19:57:47 pve03 ceph-mon[875279]: 2019-10-05 19:57:47.506 7ffb1227f440 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve03' does not exist: have you run 'mkfs'?
Oct 05 19:57:47 pve03 systemd[1]: ceph-mon@pve03.service: Main process exited, code=exited, status=1/FAILURE
Oct 05 19:57:47 pve03 systemd[1]: ceph-mon@pve03.service: Failed with result 'exit-code'.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Service RestartSec=10s expired, scheduling restart.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Scheduled restart job, restart counter is at 4.
Oct 05 19:57:57 pve03 systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Start request repeated too quickly.
Oct 05 19:57:57 pve03 systemd[1]: ceph-mon@pve03.service: Failed with result 'exit-code'.
Oct 05 19:57:57 pve03 systemd[1]: Failed to start Ceph cluster monitor daemon.

Is it possible to get around the error? Thanks!
 
@EvilBox, could you please put your posting into a new thread and add a pveversion -v to it?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!