I have a cluster with 3 nodes with Ceph. I updated node 3 to Proxmox 7 when it lost network connectivity due to bonded LACP network settings (solved on by this thread https://forum.proxmox.com/threads/u...ond-lacp-interface-not-working-anymore.92060/). Before I found out about the solution, I decided to do fresh install of node 3 with different IP and names. Later on I joined the cluster fine.
The problem appeared when I decided to install Ceph Pacific on the node 3. Joining cluster and adding OSD is fine. But when I created monitor on node 3, the whole ceph network went haywire. Note node 1 and 2 are still Octopus version.
The monitor on node 3 was stopped and cannot be started. It also cannot be destroyed with error of monitor does not exists. I assume it became ghost monitor.
Presumably because of this, all 3 nodes were piling ceph mon log on RAM and root disk. All my 3 nodes crashed because of 100% RAM usage (32GB RAM on each node) and all the root disk are full. I found out later the ceph mon log was exceeding 30GB on disk. I found out using this command
The only way I can disable the node 3 monitor was following this thread https://forum.proxmox.com/threads/ghost-monitor-in-ceph-cluster.58683/
Then I removed the directory
and re run the
I later managed to install another node (node 4) with ceph octopus just for third monitor quorum with no OSD on a spare machine.
Would I be able to downgrade node 3 to Octopus so I can make it as monitor and detach the node 4?
I was planning to upgrade node 1 and 2 to pacific but decided to wait further for pacific to stabilize as I had enough headache. I read about cluster crash on this thread https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/ that is fixed only on Pacific Test repo at the moment.
Sorry for the messy writing. I am just writing what happened in case anyone encountered the same problem.
Thanks
The problem appeared when I decided to install Ceph Pacific on the node 3. Joining cluster and adding OSD is fine. But when I created monitor on node 3, the whole ceph network went haywire. Note node 1 and 2 are still Octopus version.
The monitor on node 3 was stopped and cannot be started. It also cannot be destroyed with error of monitor does not exists. I assume it became ghost monitor.
Presumably because of this, all 3 nodes were piling ceph mon log on RAM and root disk. All my 3 nodes crashed because of 100% RAM usage (32GB RAM on each node) and all the root disk are full. I found out later the ceph mon log was exceeding 30GB on disk. I found out using this command
Code:
lsof | sort -n -r -k8 | more
Code:
systemctl disable ceph-mon@pve00-3
systemctl disable ceph-mon@pve00-3.service
/var/lib/ceph/mon/pve00-3
and re run the
Code:
systemctl disable ceph-mon@pve00-3
systemctl disable ceph-mon@pve00-3.service
I later managed to install another node (node 4) with ceph octopus just for third monitor quorum with no OSD on a spare machine.
Would I be able to downgrade node 3 to Octopus so I can make it as monitor and detach the node 4?
I was planning to upgrade node 1 and 2 to pacific but decided to wait further for pacific to stabilize as I had enough headache. I read about cluster crash on this thread https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/ that is fixed only on Pacific Test repo at the moment.
Sorry for the messy writing. I am just writing what happened in case anyone encountered the same problem.
Thanks