Ceph OSDs on Proxmox Node with VMs - not so good idea

wahmed

Famous Member
Oct 28, 2012
1,148
58
113
Calgary, Canada
www.symmcom.com
After much testing and disaster simulation i have decided not to put Ceph OSDs on Proxmox node with VMs. This not be confused with not using Proxmox+Ceph Severs together. But OSDs should not be on the same Proxmox Nodes where several Virtual Machines are served.


During a OSD failure or OSD addition when Ceph goes into rebalancing mode i have noticed between 25% to 35% CPU consumption. If all my VMs already consuming 80% of CPU, this caused major slow down of VMs. During regular operation though CPU consumption was hardly noticeable. This is not new. Ceph developers did mention that during rebalancing Ceph will consumed large amount of resource.
As long as all OSDs are in their own nodes without any VMs, all is good. On a 7 nodes cluster i put all OSDs in Node 5, 6 & 7. Then put all VMs spread across Node 1 to 4. Running same disaster simulation VMs performed much better.


Proxmox + Ceph server still shines. Because this gives us ability to monitor/manage Ceph from same GUI and eliminates need of having separate node for admin/MONs.


Anybody had experience such as this or have suggestions?
 
80% cpu load across all cores is way too high. If you're overcomitting ressources like that, I don't find it unreasonable to see this becoming a problem during recovery scenarios.

Ceph has recommended hardware specs and obviously these don't just disappear when you colocate Ceph demons with clients ;)


Somewhat unrelated note: I don't think the GUI for ceph is all that useful for monitoring your environment because it depends on you looking at it in time ( /at reasonable intervals ). As such including your ceph setup into your icinga system should give you better monitoring. Alternatively there are great standalone ceph health monitoring scripts that work without a monitoring environment or also as a data source for check_mk.

More over, Ceph is a system that is designed in such a way that you don't need to monitor it at all. The most likely HEALTH_WARN scenario with ceph is that your disks are filling up, which shouldn't happen with proper ressource planning anyhow. Once you get to a decent number of nodes, you can just ignore dead hard drives and replace them in bulk every other month or so.
 
Last edited:
80% cpu load across all cores is way too high. If you're overcomitting ressources like that, I don't find it unreasonable to see this becoming a problem during recovery scenarios.
I agree. 80% load was picked to run simulation for those "what if" situation. In a bigger environment with lots of VMs it is easy to overcommit few nodes.

Ceph has recommended hardware specs and obviously these don't just disappear when you colocate Ceph demons with clients ;)
For sure it does not. Ceph recommended hardware obviously is to run Ceph OSDs only. In order to run Proxmox VMs and Ceph OSDs on same node, obviously one has to double up on those recommendation. Which i did not in my case. Ceph recommends somewhat beefed up node. So doubling them to accommodate Proxmox+Ceph on same node is very expansive in many environments.


Somewhat unrelated note: I don't think the GUI for ceph is all that useful for monitoring your environment because it depends on you looking at it in time ( /at reasonable intervals ). As such including your ceph setup into your icinga system should give you better monitoring. Alternatively there are great standalone ceph health monitoring scripts that work without a monitoring environment or also as a data source for check_mk.
Hmm. This is the 4th times somebody mentioned Icinga as monitoring system. I really need to look into that.
Before Proxmox+Ceph, i did all my monitoring through Ceph CLI. Not too bad but it sure beats the ability to monitor both Proxmox and ceph from same GUI. I do agree the limitation of Proxmox GUI for Ceph advanced feature such as Crushmap editing and MDS. But i hope in future releases these will be added.
Can you monitor Proxmox and Ceph both cluster from Icinga? Even if Proxmox and Ceph are not on same cluster? How does Icinga compare to other solution such as Nagios, Observium etc.?

More over, Ceph is a system that is designed in such a way that you don't need to monitor it at all. The most likely HEALTH_WARN scenario with ceph is that your disks are filling up, which shouldn't happen with proper ressource planning anyhow. Once you get to a decent number of nodes, you can just ignore dead hard drives and replace them in bulk every other month or so.
I disagree with the statement of not needing monitor at all and replace hard drive in bulk every month. Although i do see your point. It is far less time consuming and stressless to replace HDD whenever they go bad. By waiting till the end of the month to replace HDDs in bulk will put extra load on Ceph cluster. Besides i think Ceph cluster performs better when there are equal number of OSDs in all nodes. Thats just my observational opinion.
Replacing all HDDs at the end of the month does have positive side though. It makes management simpler since somebody do not have to monitor constantly and replace HDD throughout the month.
 
Can you monitor Proxmox and Ceph both cluster from Icinga? Even if Proxmox and Ceph are not on same cluster? How does Icinga compare to other solution such as Nagios, Observium etc.?
Well icinga is a fork of nagios because nagios is basically a 1-man project where the guy often refused changes and made development extremely slow, hence people moved to icinga.


Besides i think Ceph cluster performs better when there are equal number of OSDs in all nodes. Thats just my observational opinion.
Its also a frequently stated recommendation. However I was talking more about the bigger picture. Once you get to a stage where you'd have to replace a disk every single week, you may want to reconsider that policy :)

Also the impact of slightly imbalanced nodes becomes less noticable to neigh invisible once you get past 6 nodes.

The general idea behind this recommendation is to always use disks that are similar in size otherwise youre severely overtaxing disks, that just aren't faster than smaller ones..
 
During a OSD failure or OSD addition when Ceph goes into rebalancing mode i have noticed between 25% to 35% CPU consumption.


Same issue here. I limited recovery and backfilling threads. Before that, a VM was almost irresponsive during backfill/revoery/pg-rebalance. Note: On high ceph utilisation it might be possible that backfill/recovery operations never finish.

Code:
         osd disk threads = 1
         osd max backfills = 1
         osd recovery max active = 1

Regards, Patrick
 
  • Like
Reactions: nejcsuhadolc
Same issue here. I limited recovery and backfilling threads. Before that, a VM was almost irresponsive during backfill/revoery/pg-rebalance. Note: On high ceph utilisation it might be possible that backfill/recovery operations never finish.

Code:
         osd disk threads = 1
         osd max backfills = 1
         osd recovery max active = 1

Regards, Patrick
I noticed the same problem. While rebalancing, the whole cluster allmost dies.

Do you need to execute the above commands on all members?

What are the rebalancing speeds you are getting with this commands?
 
You are replying to a thread that is almost 10 years old ;)

Thing have changed since then. Are you running Ceph Quincy (v17)? Then the settings on how you can throttle recovery have changed. Check out this page in our PVE wiki.