Hi, I have 2 Proxmox/Ceph clusters.
One with 4 OSD (5 disks each) db+wal on NVMe
Another with 4 OSD (10 disks each) db+wal on NVMe
First cluster upgraded and performed slow until all disks were converted to Bluestore, it's still not up to Jewel level of performance but throughput on storage improved. VMs run/feel a bit slower though... But second cluster is just very very slow, although it has a much higher load. Both clusters ran on Prox4.4/CephJewel just great.
The Proxmox nodes performance degrade over a few hours to the point where even SSH to it can take minutes. VMs on it is very slow. I suspect it's a kernel tuning issue, but if I knew for sure I wouldn't be posting here
If I run:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free
It can easily take 30minutes or even more to finish. Unaffected nodes finish in a few seconds...
We have 40Gb infiniband running on the cluster so I have kernel tuning parameters as shown in the Proxmox/infiniband Wiki. The same parameters worked just fine in Prox4/Jewel.
I tried with/without some of these sysctl.conf tunings as well:
vm.swappiness=0
vm.vfs_cache_pressure=50
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.min_free_kbytes=2097152
vm.zone_reclaim_mode=1
vm.nr_hugepages = 400
But nodes seem to die a slow death over time. After a reboot they are fine for a few hours again.
We get 40%-80% IOWait on slow nodes with is highly unusual. Before IOWait barely registered on the summary graphs.
I also tried to change ceph parameters, a.o.
Disabling debugging
osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2
But if anyone can suggest recommended Kernel and/or ceph tunings for 4 OSD, 40 disks 120TB Luminous.
Or what other info can I provide as I'm not sure if it's Luminous of prox5's kernel causing it.
Thanks for reading
Ekkas
				
			One with 4 OSD (5 disks each) db+wal on NVMe
Another with 4 OSD (10 disks each) db+wal on NVMe
First cluster upgraded and performed slow until all disks were converted to Bluestore, it's still not up to Jewel level of performance but throughput on storage improved. VMs run/feel a bit slower though... But second cluster is just very very slow, although it has a much higher load. Both clusters ran on Prox4.4/CephJewel just great.
The Proxmox nodes performance degrade over a few hours to the point where even SSH to it can take minutes. VMs on it is very slow. I suspect it's a kernel tuning issue, but if I knew for sure I wouldn't be posting here

If I run:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free
It can easily take 30minutes or even more to finish. Unaffected nodes finish in a few seconds...
We have 40Gb infiniband running on the cluster so I have kernel tuning parameters as shown in the Proxmox/infiniband Wiki. The same parameters worked just fine in Prox4/Jewel.
I tried with/without some of these sysctl.conf tunings as well:
vm.swappiness=0
vm.vfs_cache_pressure=50
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.min_free_kbytes=2097152
vm.zone_reclaim_mode=1
vm.nr_hugepages = 400
But nodes seem to die a slow death over time. After a reboot they are fine for a few hours again.
We get 40%-80% IOWait on slow nodes with is highly unusual. Before IOWait barely registered on the summary graphs.
I also tried to change ceph parameters, a.o.
Disabling debugging
osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2
But if anyone can suggest recommended Kernel and/or ceph tunings for 4 OSD, 40 disks 120TB Luminous.
Or what other info can I provide as I'm not sure if it's Luminous of prox5's kernel causing it.
Thanks for reading
Ekkas
 
	 
	
