Hi gentelmen:
I'm running a tiny cluster with 4 nodes. On Saturday there was a failure and one node started to behave weird. To make ceph chug away I decided to delete monitor from it and just let everything go with its way. Unfortunately I've deleted monitor from wrong node ... and the I've deleted the right on. After some patching up on the server everything is back to nearly normal
No. This is where is starts to get strange. At this point ceph refuses to work 100% and tells me that manager does not exist so I can go and whistle ... OK ... why mgr does not exist ?! I'm trying to create manager through guy but only option is to create it along with monitor I hit it and this is what I get:
after 10 minutes I gave up ... OK, something is wrong ... hmm let's try the old "update trick" so I've moved from v5.0 to most recent (of yesterday, I think it's 5.1 ?).
Anyway I'm trying to create monitor and I get the same thing ... and whole cluster is unresponsive. So I've deleted those monitors from /etc/pve/ceph.conf and it was nicely distributed to other machines ... at least corosync works ... I've had to manually remove all reminders of the created monitor (systemctl disable etc) to make it work.
Now, my state of cluster is that there is only one monitor ... strangely it follows old naming convention it's mon.0 not mon.hostname. I've tried to create managers with pveceph createmgr and it works perfectly (thanks I remember that in the past it was broken).
Now everything works, there are 3 managers and 1 monitor, rdb works, all VM's are humming away.
So I've stopped all VM's not to cause more damage and tried to see what's happening on machine that creates monitor, and this is what "ps aux | grep ceph" gives
As a side note this installation is started as proxmox 5.0 beta so I know that there might be some issues that are a fault of that.
I'm running a tiny cluster with 4 nodes. On Saturday there was a failure and one node started to behave weird. To make ceph chug away I decided to delete monitor from it and just let everything go with its way. Unfortunately I've deleted monitor from wrong node ... and the I've deleted the right on. After some patching up on the server everything is back to nearly normal
No. This is where is starts to get strange. At this point ceph refuses to work 100% and tells me that manager does not exist so I can go and whistle ... OK ... why mgr does not exist ?! I'm trying to create manager through guy but only option is to create it along with monitor I hit it and this is what I get:
Code:
Created symlink /etc/systemd/system/ceph-mon.target.wants/ceph-mon@proxmox-dl180-14bay-2.service -> /lib/systemd/system/ceph-mon@.service.
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
after 10 minutes I gave up ... OK, something is wrong ... hmm let's try the old "update trick" so I've moved from v5.0 to most recent (of yesterday, I think it's 5.1 ?).
Anyway I'm trying to create monitor and I get the same thing ... and whole cluster is unresponsive. So I've deleted those monitors from /etc/pve/ceph.conf and it was nicely distributed to other machines ... at least corosync works ... I've had to manually remove all reminders of the created monitor (systemctl disable etc) to make it work.
Now, my state of cluster is that there is only one monitor ... strangely it follows old naming convention it's mon.0 not mon.hostname. I've tried to create managers with pveceph createmgr and it works perfectly (thanks I remember that in the past it was broken).
Now everything works, there are 3 managers and 1 monitor, rdb works, all VM's are humming away.
So I've stopped all VM's not to cause more damage and tried to see what's happening on machine that creates monitor, and this is what "ps aux | grep ceph" gives
Code:
ceph 2026 2.0 4.3 2867712 2160148 ? Ssl 00:18 23:26 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph 2160 1.9 4.3 2851180 2135456 ? Ssl 00:18 22:22 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph 2292 1.9 4.2 2822092 2106240 ? Ssl 00:18 21:57 /usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph
ceph 2420 1.8 4.3 2855252 2147712 ? Ssl 00:18 20:30 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph
ceph 9779 0.2 0.0 428020 47544 ? Ssl 00:59 3:04 /usr/bin/ceph-mgr -f --cluster ceph --id proxmox-dl180-14bay-2 --setuser ceph --setgroup ceph
root 16129 0.0 0.2 548388 103052 ? Ss 19:10 0:00 task UPID:proxmox-dl180-14bay-2:00003F01:0067A4E1:5A89CFAE:cephcreatemon:mon.proxmox-dl180-14bay-2:root@pam:
root 16166 0.0 0.1 548388 97632 ? S 19:10 0:00 task UPID:proxmox-dl180-14bay-2:00003F01:0067A4E1:5A89CFAE:cephcreatemon:mon.proxmox-dl180-14bay-2:root@pam:
root 16168 0.1 0.0 34692 9804 ? S 19:10 0:00 /usr/bin/python /usr/sbin/ceph-create-keys -i proxmox-dl180-14bay-2
ceph 16169 0.4 0.1 468072 73516 ? Ssl 19:10 0:01 /usr/bin/ceph-mon -f --cluster ceph --id proxmox-dl180-14bay-2 --setuser ceph --setgroup ceph
root 18006 0.6 0.0 566624 18564 ? Sl 19:16 0:00 /usr/bin/rados -p rbd -m 192.168.123.241,192.168.123.242 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/ceph_storage.keyring df
root 18022 0.0 0.0 12788 984 pts/0 S+ 19:16 0:00 grep ceph
As a side note this installation is started as proxmox 5.0 beta so I know that there might be some issues that are a fault of that.