Hyperconverged Ceph cluster and memory management Ceph/KVM

voltage · Nov 19, 2019

Hello all,

with one customer I run a Proxmox 5.4 Ceph cluster with 4 nodes. Two of the nodes have 24 GB of memory and 5 OSDs each, and these nodes are tasked only with one (important) VM each.

On both nodes we see a disturbing pattern:

- after reboot, all the ceph-osd processes start out below 1G of memory usage which would be ok
- then with time, the processes grow up to 4G of memory usage
- some out of memory conditions happen, sometimes only failed backups, sometimes crashes of the VM failing to get all of the allocated (in VM configuration) memory

To recap, servers with 24 GB memory and running only one (1) VM with about 8 GB main memory, can run only stable for about 1-2 weeks until crashes and out-of-memory-situations happen.

Why are the ceph-osd processes growing in their memory usage so much? How to limit their memory usage and how to solve this problem?

Both nodes should be able to take 2 VMs (16 GB) and only require 8 GB for proxmox system and Ceph.

Please help,

Andreas Bauer

voltage · Nov 19, 2019

And I have read up on documentation and my reaction was ...

bluestore_cache_autotune
Default
true

osd_memory_target
Default
4294967296

Is this the proxmox default? Every OSD will shoot for 4 Gigabytes of cache?

This of course explains the problem we have. Is it safe to set osd_memory_target to 1G without triggering any other bugs?

These are production machines and the customer is very unhappy about the issues. I cannot afford to produce further downtime because of this issue because the customer is shouting at me already.

Alwin · Nov 20, 2019

voltage said:
Is this the proxmox default? Every OSD will shoot for 4 Gigabytes of cache?

This is the Ceph default.

voltage said:
Is it safe to set osd_memory_target to 1G without triggering any other bugs?

Well, more memory would be better. But if this is no option then you can reduce the memory target, with the reduced performance, OFC.

voltage · Nov 20, 2019

I might suggest maybe it is wise to modify that default down to 1G, as on hyperconverged clusters the memory obviously has to be shared. Migrating from DRBD on pre-exisiting installations (as in our case) this will bite.

Is there any reason against changing that setting on a running cluster OSD by OSD? Any other bug that might be triggered? I am weary to do this outside of a maintainance window because of clear instructions of the customer not to change anything outside theses windows.

So we might have to restart the nodes in question every few days until we can change the parameter.

Has anyone here had a problem changing the osd_memory_target node-by-node?

voltage · Nov 20, 2019

In the next maintenance window I will check the performance. What kind of performance would one approximately expect to see from a 4-node, 10GbE cluster with one 1TB Samsung 970 NVMe SSDs (bluestore) on each host and a 2/1 replication rule?

We see only about 300-400 MB/s with 4k/1M blocksize reads (measured by dd with rbd-nbd).

That seams I little low anyway ?

Alwin · Nov 20, 2019

voltage said:
We see only about 300-400 MB/s with 4k/1M blocksize reads (measured by dd with rbd-nbd).

What do you expect? No resources, no performance.

voltage said:
2/1 replication rule

And this is dangerous. Any subsequent failure on an unhealthy cluster will likely result in lost data.

voltage · Nov 20, 2019

Alwin said:
What do you expect? No resources, no performance.

No ressources? Even without cache, the devices can fetch upwards of 2 Gigabyte/s with 128k blocksize. The network does 1,2 Gigabyte/s without Jumbo frames. With two copies, the theoretical bandwith should be worst case (date only remote) 1,2 Gigabyte/s on a idle Ceph cluster. Or am i wrong?

So the difference between 300 MB/s and 1,2 GB/s are latency? Any tips how to tune for a bit more speed?

Alwin said:
And this is dangerous. Any subsequent failure on an unhealthy cluster will likely result in lost data.

As dangerous as a RAID 1 ;-)

Indeed very dangerous with daily backups. Just joking.

We run 3/2 on spinning metal and cheap SSDs, and 2/1 on the highest performance quality SSD drives.

Alwin · Nov 21, 2019

voltage said:
So the difference between 300 MB/s and 1,2 GB/s are latency? Any tips how to tune for a bit more speed?

Writes are done with sync and to one OSD (primary), that OSD takes care of the replication for the other copies. Only if all participating OSD have written the data, then the ACK is returned to the client. Reads are done in parallel.

For Ceph benchmarks, use rados bench and fio. dd is not a benchmark tool.
https://www.proxmox.com/de/downloads/item/proxmox-ve-ceph-benchmark

voltage said:
As dangerous as a RAID 1 ;-)

Not really. I know you meant that jokingly. But as other read this too. The min_size just tells Ceph, at how many copies the pool should be kept in read/write mode. In a degraded state with 2/1, a copy might not be written out to the disk yet (in-flight). A subsequent failure on the already stressed disks will likely produce data loss.

voltage said:
Indeed very dangerous with daily backups. Just joking.

Not everyone has backups or does regular restore testing of them.

voltage · Nov 21, 2019

Alwin said:
Writes are done with sync and to one OSD (primary), that OSD takes care of the replication for the other copies. Only if all participating OSD have written the data, then the ACK is returned to the client. Reads are done in parallel.

For Ceph benchmarks, use rados bench and fio. dd is not a benchmark tool.
https://www.proxmox.com/de/downloads/item/proxmox-ve-ceph-benchmark

Reads are done in parallel, so I would expect more than the rate we achieve now. dd is in my book a pretty good way to simulate single threaded sequential reads.

Alwin said:
Not really. I know you meant that jokingly. But as other read this too. The min_size just tells Ceph, at how many copies the pool should be kept in read/write mode. In a degraded state with 2/1, a copy might not be written out to the disk yet (in-flight). A subsequent failure on the already stressed disks will likely produce data loss.

Not everyone has backups or does regular restore testing of them.

[In a RAID1] ... In a degraded state ...A subsequent failure on the already stressed disks will likely produce data loss.

Very true. RAID1, you are degraded and loose a further disk, you are f****. Same as in a Ceph 2/1. Which the customer decided to be acceptable.

What he does not like is the high number of outages that stem not from hardware failure, but from software (mis)configuration.

Nobody has chimed in, is it safe to change osd_memory_target during the Cluster being operational?

Alwin · Nov 21, 2019

voltage said:
Reads are done in parallel, so I would expect more than the rate we achieve now.

I assume a bond of 2x 10 GbE. The 1.2 GB/s are more then the limit of a single 10 GbE NIC. Whichever bond mode (besides active-backup) is configured, in a small cluster the likelihood of traffic passing through the same interface is very high.

voltage said:
Very true. RAID1, you are degraded and loose a further disk, you are f****. Same as in a Ceph 2/1. Which the customer decided to be acceptable.

Yes. The difference is though, that with a RAID1, no extra data movement occurs for already written data. This is contrary to Ceph, where the data is moved due to re-balancing. This was more the point I aimed at.

voltage said:
Nobody has chimed in, is it safe to change osd_memory_target during the Cluster being operational?

Each OSD daemon needs to restart, as I don't think it is a live option. But aside from that, I am not aware of an issue on a running cluster.

Search

Search

Hyperconverged Ceph cluster and memory management Ceph/KVM

voltage

Member

voltage

Member

Alwin

Proxmox Retired Staff

voltage

Member

voltage

Member

Alwin

Proxmox Retired Staff

voltage

Member

Alwin

Proxmox Retired Staff

voltage

Member

Alwin

Proxmox Retired Staff