[SOLVED] Proxmox+Ceph no de Hi IOWait

wahmed

Famous Member
Oct 28, 2012
1,114
44
113
Calgary, Canada
www.symmcom.com
I am seeing high IOwait on all proxmox+ceph node in my cluster. Any idea what this might be causing it or what to tweak in ceph configuration? The IO wait is high across all Proxmox+Ceph nodes. There are no VMs running on any of the OSD nodes.

Ceph cluster:
Nodes = 6
OSDs = 36 (6 per node)
OSD OP Threads = 4
OSD Disk Threads = 2
Filestore OP Threads = 2
ceph-io.png

Any help from fellow ceph users is greatly appreciated!
 
Last edited:
Re: Proxmox+Ceph no de Hi IOWait

I am seeing high IOwait on all proxmox+ceph node in my cluster. Any idea what this might be causing it or what to tweak in ceph configuration? The IO wait is high across all Proxmox+Ceph nodes. There are no VMs running on any of the OSD nodes.

Ceph cluster:
Nodes = 6
OSDs = 36 (6 per node)
OSD OP Threads = 4
OSD Disk Threads = 2
Filestore OP Threads = 2
View attachment 2685

Any help from fellow ceph users is greatly appreciated!
Hi,
are there deep-scrubs running?
Due to "OSD Disk Threads = 2" they can eat twice IO as with osd_disk_threads = 1.

Look with atop - perhaps you can better see the bottleneck.

Perhaps many IOs? Have you ceph-dash or similiar running to see easy the performance?

Udo
 
Re: Proxmox+Ceph no de Hi IOWait

are there deep-scrubs running?
No running scrubs.

Due to "OSD Disk Threads = 2" they can eat twice IO as with osd_disk_threads = 1.
Look with atop - perhaps you can better see the bottleneck.
I now turned it down to 1 thread. Lets see what happens. Atop shows around 520 threads per OSD. Thats lot of threads considering each node got 6 OSDs. Reducing OP Threads will help with affecting performance somewhat?

Perhaps many IOs? Have you ceph-dash or similiar running to see easy the performance?
Yes i religiously use ceph-dash. It shows avg. of 1400 op/sec.
 
Re: Proxmox+Ceph no de Hi IOWait

No running scrubs.


I now turned it down to 1 thread. Lets see what happens. Atop shows around 520 threads per OSD. Thats lot of threads considering each node got 6 OSDs. Reducing OP Threads will help with affecting performance somewhat?
Hmm, perhaps... but i think not.
Yes i religiously use ceph-dash. It shows avg. of 1400 op/sec.
1400 op/sec are not too bad - perhaps it's an normal load?!
I have just now - during backup time - an comparable wait on the a ceph-node (40%)-

Udo
 
Re: Proxmox+Ceph no de Hi IOWait

I'll be curious as to how your research in this turns out.

I have a much smaller Ceph cluster (3 nodes, 6 OSDs per node) my iowait/io delay usually sits between .5% - 1.5% topping out at about 10% during the period when the VMs are being backed up by proxmox - my ops are much smaller too, but my cluster, even with 20vms isn't super active most of the time (like 20-500 op/s normally) - I'll be moving my last two VMs that are much more active over to proxmox/ceph within the next week so I'll be curious if it will be something I need to worry about. Yours seems high but for all I know might be normal? Please post back with what you find!
 
Re: Proxmox+Ceph no de Hi IOWait

I have a much smaller Ceph cluster (3 nodes, 6 OSDs per node) my iowait/io delay usually sits between .5% - 1.5% topping out at about 10% during the period when the VMs are being backed up by proxmox - my ops are much smaller too, but my cluster, even with 20vms isn't super active most of the time (like 20-500 op/s normally) -
In our cluster we have about 55 VMs right now with varieties of load such as RDP, Exchange, SQL etc etc. I reduced disk thread to 1 and op thread to 2. Following screenshot shows when i made the changes and how the graph looks after that:
ceph-io-2.png

I agree, for all we know this could be just normal in Ceph for the load we have. But if there is way to increase performance and much more efficient, i am all for it. I am not using SSD for journal. All journals are co located with SSD. Following screenshot is somewhat avg. IO, read/write and number of PG in the cluster:
ceph-io-3.png

Udo, you have higher number of OSD than i do and journals are on SSD. During backup time you get avg. 40% IO Wait. How about during day to day operation? Think SSD journal will reduce the IO wait? I would not usually bother too much with IO wait, but i am also seeing reduce performance, somewhat slow down. Thus digging into it. Real world scenario is somewhat different than the numerous lab test i did.
Is the high IO wait due to single thread nature of Ceph. Was not multi-thread feature suppose to be added in Giant ? Need to look into it.

I'll be moving my last two VMs that are much more active over to proxmox/ceph within the next week so I'll be curious if it will be something I need to worry about. Yours seems high but for all I know might be normal? Please post back with what you find!
Although i am putting lot of time and effort into finding this IO wait/reduced performance thing, but i will use Ceph every time without hesitation. The benefit of redundancy cluster, ease of management just outweighs this minor glitch. I am sure ZFS, Gluster lover would say the same thing. :)
I am actually using all these 3 in our Proxmox cluster. Proxmox+Ceph for main storage backend for virtual disk images and Proxmox+ZFS+Gluster for all backup nodes. Inclusion of these plugins straight into Proxmox sweetened the deal greatly. We actually got rid of all FreeNAS+NFS and Napp-It setup we had and replaced them with Proxmox+(Ceph, ZFS, Gluster). Made our management and monitoring of nodes much easier.
 
Re: Proxmox+Ceph no de Hi IOWait

Udo, you have higher number of OSD than i do and journals are on SSD. During backup time you get avg. 40% IO Wait. How about during day to day operation? Think SSD journal will reduce the IO wait? I would not usually bother too much with IO wait, but i am also seeing reduce performance, somewhat slow down. Thus digging into it. Real world scenario is somewhat different than the numerous lab test i did.
Is the high IO wait due to single thread nature of Ceph. Was not multi-thread feature suppose to be added in Giant ? Need to look into it.
Hi Wasim,
I look tomorow during working time for the IO-wait.

One thing about your IOPS: many small read/writes. Looks that you don't have enabled rbd cache?
rbd cache combined many small IOs to less bigger IOs.
Looks like
Code:
[client]
rbd cache = true
rbd cache writethrough until flush = true
Is the default in newer ceph versions (which version do you have running?).

Udo
 
Re: Proxmox+Ceph no de Hi IOWait

One thing about your IOPS: many small read/writes. Looks that you don't have enabled rbd cache?
rbd cache combined many small IOs to less bigger IOs.
Very good point!
I completely forgot i disabled the rbd cache 2 months ago for something then forgot to enable it back. I have reenabled both to true and now IOPS seems to averaging around 850. Also IOwait dropped too on all nodes, at least for short time. Below is the screenshot of IO wait and Server load right after turning on rbd cache then slowly going back up again:
ceph-io-4.png

I dont see much changes in the IOPS though. Still averaging around 800-850 from previous IOPS ~1400. Increasing disk thread to 4 and OP thread to 8 should increase performance, no? The nodes can take it. They are all Xeon E3s with 32gb RAM. Just not sure how this will affect day to day performance, positively or negatively.


Is the default in newer ceph versions (which version do you have running?).
I am currently on ceph 0.80.9.

I also use CephFS on same cluster with filestore thread 2.
 
Re: Proxmox+Ceph no de Hi IOWait

Very good point!
I completely forgot i disabled the rbd cache 2 months ago for something then forgot to enable it back. I have reenabled both to true and now IOPS seems to averaging around 850. Also IOwait dropped too on all nodes, at least for short time. Below is the screenshot of IO wait and Server load right after turning on rbd cache then slowly going back up again:
View attachment 2690

I dont see much changes in the IOPS though. Still averaging around 800-850 from previous IOPS ~1400. Increasing disk thread to 4 and OP thread to 8 should increase performance, no? The nodes can take it. They are all Xeon E3s with 32gb RAM. Just not sure how this will affect day to day performance, positively or negatively.



I am currently on ceph 0.80.9.

I also use CephFS on same cluster with filestore thread 2.
Hi Wasim,
my high iowait value must be an short spike!

I put cpu-values in the monitoring and the result is an small iowait (average 4.5%) - it's not easy to estimate how much benefit came through the SSD-journal (but I think a lot).
iowait_ceph07.png
I have no experiences with cephfs! Can you say, how much IO came from cephfs??

My ceph-cversion is 0.94.1

Udo
 
Re: Proxmox+Ceph no de Hi IOWait

I did some playing around with Threads. I changed Disk thread to 8 and OP thread to 12. Following screenshot was taken after the cluster ran for several hours:
ceph-io-5.png
IOWait does not seem to change that much even with high thread. Atop shows thread count per OSD is still the same around 550, before and after making thread changes.

I put cpu-values in the monitoring and the result is an small iowait (average 4.5%) - it's not easy to estimate how much benefit came through the SSD-journal (but I think a lot).
Did you happen to see the thread setting for your OSDs?


I have no experiences with cephfs! Can you say, how much IO came from cephfs??
My ceph-cversion is 0.94.1
I am not sure how to check CephFS IO. I use CephFS only to store ISO templates and install occasional VZcontainers to test. So the cephfs is not actively use on regular basis.
 
Re: Proxmox+Ceph no de Hi IOWait

I did some playing around with Threads. I changed Disk thread to 8 and OP thread to 12. Following screenshot was taken after the cluster ran for several hours:
View attachment 2693
IOWait does not seem to change that much even with high thread. Atop shows thread count per OSD is still the same around 550, before and after making thread changes.


Did you happen to see the thread setting for your OSDs?
Hi,
sure!
Code:
root@ceph-01:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep thread
    "xio_portal_threads": "2",
    "ms_rwthread_stack_bytes": "1048576",
    "ms_async_op_threads": "2",
    "fuse_multithreaded": "true",
    "osd_op_threads": "4",
    "osd_disk_threads": "4",
    "osd_disk_thread_ioprio_class": "idle",
    "osd_disk_thread_ioprio_priority": "7",
    "osd_recovery_threads": "1",
    "osd_op_num_threads_per_shard": "2",
    "osd_op_thread_timeout": "15",
    "osd_recovery_thread_timeout": "30",
    "osd_snap_trim_thread_timeout": "3600",
    "osd_scrub_thread_timeout": "60",
    "osd_scrub_finalize_thread_timeout": "600",
    "osd_remove_thread_timeout": "3600",
    "osd_command_thread_timeout": "600",
    "threadpool_default_timeout": "60",
    "threadpool_empty_queue_max_wait": "2",
    "filestore_op_threads": "4",
    "filestore_op_thread_timeout": "60",
    "filestore_op_thread_suicide_timeout": "180",
    "keyvaluestore_op_threads": "2",
    "keyvaluestore_op_thread_timeout": "60",
    "keyvaluestore_op_thread_suicide_timeout": "180",
    "rgw_op_thread_timeout": "600",
    "rgw_op_thread_suicide_timeout": "0",
    "rgw_thread_pool_size": "100",
    "internal_safe_to_start_threads": "true"

Still many IOs with less througput?
Is rbd_cache is active?
Code:
root@ceph-01:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep rbd_cache
    "rbd_cache": "true",
    "rbd_cache_writethrough_until_flush": "true",
    "rbd_cache_size": "33554432",
    "rbd_cache_max_dirty": "25165824",
    "rbd_cache_target_dirty": "16777216",
    "rbd_cache_max_dirty_age": "1",
    "rbd_cache_max_dirty_object": "0",
    "rbd_cache_block_writes_upfront": "false",
And AFAIK the VMs need to stop/started for rbd_cache changes.

Udo
 
Re: Proxmox+Ceph no de Hi IOWait

Hi,
most of my settings are default.

Here is the diff:
Code:
root@ceph-01:/home/ceph# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config diff
{
    "diff": {
        "current": {
            "auth_client_required": "cephx",
            "auth_supported": "cephx",
            "cluster_addr": "192.168.X.11:0\/0",
            "cluster_network": "192.168.X.0\/24",
            "filestore_max_sync_interval": "10",
            "filestore_op_threads": "4",
            "fsid": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
            "internal_safe_to_start_threads": "true",
            "keyring": "\/var\/lib\/ceph\/osd\/ceph-0\/keyring",
            "leveldb_log": "",
            "log_to_stderr": "false",
            "osd_backfill_full_ratio": "0.9",
            "osd_disk_thread_ioprio_class": "idle",
            "osd_disk_thread_ioprio_priority": "7",
            "osd_disk_threads": "4",
            "osd_enable_op_tracker": "false",
            "osd_journal": "\/dev\/disk\/by-partlabel\/journal-0",
            "osd_journal_size": "10000",
            "osd_max_backfills": "1",
            "osd_op_threads": "4",
            "osd_pool_default_min_size": "1",
            "osd_recovery_max_active": "1",
            "osd_scrub_load_threshold": "2.5",
            "pid_file": "\/var\/run\/ceph\/osd.0.pid",
            "public_addr": "172.20.X.11:0\/0",
            "public_network": "172.20.X.0\/24"
        }
Udo
 
Re: Proxmox+Ceph no de Hi IOWait

Decided to reboot all the Proxmox+Ceph nodes to see if that changes anything. Below is the screenshot of one of the node after reboot. This behavior shows on all ceph nodes. IOWait went down and stayed down even though no VM changed occurred:
ceph-io-6.png

Current ceph configuration:
Rbd cache = true
Rbd cache flush = true
Disk threads = 4
OP Threads = 4

Not quite sure why. But it does not feel as sluggish as it was before the reboot. I also upgraded all Proxmox nodes. I will monitor next one week straight and see what happens.
 
Re: Proxmox+Ceph no de Hi IOWait

Just an observation: Could it be that setting Rbd cache = true is only used for new instances and is first effective on core after a reboot/restart of rdb service?
 
[SOLVED] Re: Proxmox+Ceph no de Hi IOWait

It appears that i have solved this issue. Those gaps in Stats where happening due to an embarrassing fault of my own.

The RBD storage is connected with Promox with 5 MONs. In last couple of months i have rearranged quite a few nodes by rejoining them back in Proxmox with different hostname and IPs. After looking at the storage.cfg I realized all but 1 original node is still unchanged and all other nodes no longer exists. So was trying to connect to the RBD with that node while keep trying with others. As soon as changed the IP addressed of correct MONs in operation i no longer see the gaps.
I only went to storage.cfg to figure out if i can forcefully detach and attach new gluster node with same storage ID.

Thanks to all who tried to help! As always you guys are priceless!
 
Just to add additional update to this solved issue,
In addition to the gaps in GUI Stats graph, it appears that it was also causing interruption in Linux servers specifically CentOS VMs. VMs were going in Kernel Panic out of sync, out of memory which caused total freeze of the VM. After solving storage issue none of the VM went in panic.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!