Hyperconverged Ceph cluster and memory management Ceph/KVM

voltage

Member
Nov 2, 2015
11
0
21
Hello all,

with one customer I run a Proxmox 5.4 Ceph cluster with 4 nodes. Two of the nodes have 24 GB of memory and 5 OSDs each, and these nodes are tasked only with one (important) VM each.

On both nodes we see a disturbing pattern:

- after reboot, all the ceph-osd processes start out below 1G of memory usage which would be ok
- then with time, the processes grow up to 4G of memory usage
- some out of memory conditions happen, sometimes only failed backups, sometimes crashes of the VM failing to get all of the allocated (in VM configuration) memory

To recap, servers with 24 GB memory and running only one (1) VM with about 8 GB main memory, can run only stable for about 1-2 weeks until crashes and out-of-memory-situations happen.

Why are the ceph-osd processes growing in their memory usage so much? How to limit their memory usage and how to solve this problem?

Both nodes should be able to take 2 VMs (16 GB) and only require 8 GB for proxmox system and Ceph.

Please help,

Andreas Bauer
 
And I have read up on documentation and my reaction was ... o_O

bluestore_cache_autotune
Default
true

osd_memory_target
Default
4294967296

Is this the proxmox default? Every OSD will shoot for 4 Gigabytes of cache?

This of course explains the problem we have. Is it safe to set osd_memory_target to 1G without triggering any other bugs?

These are production machines and the customer is very unhappy about the issues. I cannot afford to produce further downtime because of this issue because the customer is shouting at me already.
 
Last edited:
Is this the proxmox default? Every OSD will shoot for 4 Gigabytes of cache?
This is the Ceph default.

Is it safe to set osd_memory_target to 1G without triggering any other bugs?
Well, more memory would be better. But if this is no option then you can reduce the memory target, with the reduced performance, OFC.
 
I might suggest maybe it is wise to modify that default down to 1G, as on hyperconverged clusters the memory obviously has to be shared. Migrating from DRBD on pre-exisiting installations (as in our case) this will bite.

Is there any reason against changing that setting on a running cluster OSD by OSD? Any other bug that might be triggered? I am weary to do this outside of a maintainance window because of clear instructions of the customer not to change anything outside theses windows.

So we might have to restart the nodes in question every few days until we can change the parameter.

Has anyone here had a problem changing the osd_memory_target node-by-node?
 
In the next maintenance window I will check the performance. What kind of performance would one approximately expect to see from a 4-node, 10GbE cluster with one 1TB Samsung 970 NVMe SSDs (bluestore) on each host and a 2/1 replication rule?

We see only about 300-400 MB/s with 4k/1M blocksize reads (measured by dd with rbd-nbd).

That seams I little low anyway ?
 
We see only about 300-400 MB/s with 4k/1M blocksize reads (measured by dd with rbd-nbd).
What do you expect? No resources, no performance.

2/1 replication rule
And this is dangerous. Any subsequent failure on an unhealthy cluster will likely result in lost data.
 
  • Like
Reactions: Tmanok
What do you expect? No resources, no performance.
No ressources? Even without cache, the devices can fetch upwards of 2 Gigabyte/s with 128k blocksize. The network does 1,2 Gigabyte/s without Jumbo frames. With two copies, the theoretical bandwith should be worst case (date only remote) 1,2 Gigabyte/s on a idle Ceph cluster. Or am i wrong?

So the difference between 300 MB/s and 1,2 GB/s are latency? Any tips how to tune for a bit more speed?

And this is dangerous. Any subsequent failure on an unhealthy cluster will likely result in lost data.
As dangerous as a RAID 1 ;-)

Indeed very dangerous with daily backups. Just joking.

We run 3/2 on spinning metal and cheap SSDs, and 2/1 on the highest performance quality SSD drives.
 
So the difference between 300 MB/s and 1,2 GB/s are latency? Any tips how to tune for a bit more speed?
Writes are done with sync and to one OSD (primary), that OSD takes care of the replication for the other copies. Only if all participating OSD have written the data, then the ACK is returned to the client. Reads are done in parallel.

For Ceph benchmarks, use rados bench and fio. dd is not a benchmark tool.
https://www.proxmox.com/de/downloads/item/proxmox-ve-ceph-benchmark

As dangerous as a RAID 1 ;-)
Not really. I know you meant that jokingly. But as other read this too. The min_size just tells Ceph, at how many copies the pool should be kept in read/write mode. In a degraded state with 2/1, a copy might not be written out to the disk yet (in-flight). A subsequent failure on the already stressed disks will likely produce data loss.

Indeed very dangerous with daily backups. Just joking.
Not everyone has backups or does regular restore testing of them. ;)
 
Writes are done with sync and to one OSD (primary), that OSD takes care of the replication for the other copies. Only if all participating OSD have written the data, then the ACK is returned to the client. Reads are done in parallel.

For Ceph benchmarks, use rados bench and fio. dd is not a benchmark tool.
https://www.proxmox.com/de/downloads/item/proxmox-ve-ceph-benchmark
Reads are done in parallel, so I would expect more than the rate we achieve now. dd is in my book a pretty good way to simulate single threaded sequential reads.

Not really. I know you meant that jokingly. But as other read this too. The min_size just tells Ceph, at how many copies the pool should be kept in read/write mode. In a degraded state with 2/1, a copy might not be written out to the disk yet (in-flight). A subsequent failure on the already stressed disks will likely produce data loss.

Not everyone has backups or does regular restore testing of them. ;)
[In a RAID1] ... In a degraded state ...A subsequent failure on the already stressed disks will likely produce data loss.

Very true. RAID1, you are degraded and loose a further disk, you are f****. Same as in a Ceph 2/1. Which the customer decided to be acceptable.

What he does not like is the high number of outages that stem not from hardware failure, but from software (mis)configuration.

Nobody has chimed in, is it safe to change osd_memory_target during the Cluster being operational?
 
Reads are done in parallel, so I would expect more than the rate we achieve now.
I assume a bond of 2x 10 GbE. The 1.2 GB/s are more then the limit of a single 10 GbE NIC. Whichever bond mode (besides active-backup) is configured, in a small cluster the likelihood of traffic passing through the same interface is very high.

Very true. RAID1, you are degraded and loose a further disk, you are f****. Same as in a Ceph 2/1. Which the customer decided to be acceptable.
Yes. The difference is though, that with a RAID1, no extra data movement occurs for already written data. This is contrary to Ceph, where the data is moved due to re-balancing. This was more the point I aimed at. ;)

Nobody has chimed in, is it safe to change osd_memory_target during the Cluster being operational?
Each OSD daemon needs to restart, as I don't think it is a live option. But aside from that, I am not aware of an issue on a running cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!