I would like to ask you for help because I am running out of ideas on how to solve our issue.
A couple of months ago (just after updating from Proxmox 7.2 to 7.3), we began to receive messages from our Ceph system indicating that there was a "health warn".
The messages we receive are like the following (not always the same monitor):
Our cluster recovers usually after a few minutes and when we launch the command "ceph -s" everything is correct and the Proxmox system works fine.
But sometimes after we receive messages like those, we have experimented a reboot of the physical node (Proxmox). This only happens with nodes that are running the monitor service.
Our Hyper-Converged Proxmox cluster is made up of 12 physical nodes (48 Cores / 384GB RAM each node).
In every physical machine we have 4 OSD (Intel D3-S4610 3.84TB).
Five nodes are also monitors.
Our Ceph is currently at 62% of capacity
Nodes are currently using 10-20% cpu and at 40-50% RAM
All OSD's are currently under 10ms of "Apply/Commit Latency"
Note:
Proxmox PVE: 7.3-3 (enterprise repo)
Ceph: 16.2.9
Kernel: 5.15.74-1-pve
Any help will be appreciated
A couple of months ago (just after updating from Proxmox 7.2 to 7.3), we began to receive messages from our Ceph system indicating that there was a "health warn".
The messages we receive are like the following (not always the same monitor):
Code:
[WARN] MON_DOWN: 1/5 mons down, quorum vrt-05,vrt-08,vrt-02,vrt-09
mon.vrt-01 (rank 4) addr [v2:10.61.12.201:3300/0,v1:10.61.12.201:6789/0] is down (out of quorum)
Our cluster recovers usually after a few minutes and when we launch the command "ceph -s" everything is correct and the Proxmox system works fine.
But sometimes after we receive messages like those, we have experimented a reboot of the physical node (Proxmox). This only happens with nodes that are running the monitor service.
Our Hyper-Converged Proxmox cluster is made up of 12 physical nodes (48 Cores / 384GB RAM each node).
In every physical machine we have 4 OSD (Intel D3-S4610 3.84TB).
Five nodes are also monitors.
Our Ceph is currently at 62% of capacity
Nodes are currently using 10-20% cpu and at 40-50% RAM
All OSD's are currently under 10ms of "Apply/Commit Latency"
Note:
Proxmox PVE: 7.3-3 (enterprise repo)
Ceph: 16.2.9
Kernel: 5.15.74-1-pve
Any help will be appreciated
Last edited: