Ceph: sudden slow ops, freezes, and slow-downs

fitbrian · Jun 20, 2022

Hello,

I would like to ask you for help because I am running out of ideas on how to solve our issue.

We run 4 nodes Proxmox Ceph cluster on OVH. The internal network for the cluster is built on OVH vRack with a bandwidth 4Gbps. Within the cluster, we use CephFS as storage for shared data that webservers (2 VM) are using, such as web data (scripts), images, vhosts, and some others. Those data are connected over NFS to webservers' VMs. In addition, there are 3 MariaDB VMs with Galera3 multi-master replication. There are some more VM like Redis which are not important in this case. All VMs have their disks/images in Ceph.

Since the cluster was built, we have used Ceph Nautilus, without absolutely no problems and during the main season, traffic was at least double that now.

Two weeks ago I have updated Ceph from Nautilus to Octopus and MariaDB VMs to Deb Bullseye. Everything seems working well, except that a few times per day suddenly a huge number of kworker/sda processes appeared on the MariaDB VMs (like 2000 processes like that) but everything worked fine and after a few minutes, they went gone. I did not find any reason for that. Nothing in logs, no slow iops, graphs looked fine. So, I kept it as it was.

A few days ago (last Thursday) Ceph started to report slow ops and logged errors like this:

Code:

Jun 18 09:07:14 node3 ceph-osd[2030776]: 2022-06-18T09:07:14.840+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:15 node3 ceph-osd[2030776]: 2022-06-18T09:07:15.876+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:16 node3 ceph-osd[2030776]: 2022-06-18T09:07:16.920+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:17 node3 ceph-osd[2030776]: 2022-06-18T09:07:17.876+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:18 node3 ceph-osd[2030776]: 2022-06-18T09:07:18.868+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
OR
Jun 20 13:30:45 node1 ceph-osd[2017144]: 2022-06-20T13:30:45.131+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)
Jun 20 13:30:46 node1 ceph-osd[2017144]: 2022-06-20T13:30:46.099+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)
Jun 20 13:30:47 node1 ceph-osd[2017144]: 2022-06-20T13:30:47.083+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)

The errors appeared on all nodes within all OSDs. MariaDB servers have high I/O waits (90% and more), NFS on webservers started to respond slowly, I/O waits on webservers were high too. My first impression was network problems. So, I made a double-check network communication across all servers and everything was good. No packet losses and ping were good. I have contacted OVH support if there is something wrong possibly on their side but they confirm they are good. So, I rebooted all nodes, I tried to replace vrack service (put all servers out, create a new vrack in ovh and add them back to a newly created one). I double-checked all disks health. Nothing has helped to solve the problem and it affected our production services very badly.

So, I had to put one disk out of the Ceph as a single/local/standalone disk on one pve server and in the first stage I migrated one of the MariaDB servers to this local disk and stopped Galera replication, so only one mysql stayed working. The situation was a little bit better regarding the affections of our services but not ideal - webserver still froze with high I/O and ceph still logged slow ops, of course. Therefore I moved one of the webservers to the same local disk as mariadb server, out of the ceph. I also moved all scripts and webapp data (except for static files like vhosts and images) from ceph to local disk within webserver and in the load-balancing, I have disabled the second webserver. The current state is we run only with one VM as webserver and one VM as mysql/mariadb without replication (without even RAID!). In this stage, the situation returned to normal and our services worked as before and are stable. Ceph was not logging any other slow ops messages. Except for one situation, which is mysql backup. When mysql backup is executed, by using mariabackup stream backup, slow iops and ceph slow ops errors are back. Backup is provided to the cephfs connected to the mysql/mariadb VM.

I have tried to do some I/O stress tests by fio utility. I have tested:
- from PVE machines to the mounted cephfs (/mnt/pve/cephfs),
- from Mysql VM to mounted cephfs
- from webserver to mounted cephfs over NFS
- from the second webserver (which is located in Ceph not on the local disk as currently working webserver) to root (/) and also to cephfs mounted over NFS

Every FIO test was successful, no ceph slow ops, not high I/Os. No network speed and disk speed issues, no errors. In the other words, I am not able to reproduce the situation when Ceph is reporting slow ops and everything goes to....

Yesterday I upgraded all physical machines to the Debian Bullseye and Ceph to Pacific. So now all is updated to the very latest version available. But the slow ops appeared again (as I described before, now it is happening where mysql is backuping up).

I tried to google the error and possible symptoms but I did not found anything helpful to solve this issue. Our production infrastructure is now running in very limited, a temporary state with only one web server and one mysql/mariadb server, not replicated, disk not mirrored, no H/A, no load-balancing. Everything has had to been disabled.

I have found something interested in "ceph osd dump" which is "blocklist". I have no idea if this might be related somehow and if so how and how to solve it. I even do not know if something like that was there before the problems occurs.

Do you someone has any idea what should be possible wrong, how to diagnose/debug and solve the problem, please? I am truly out of ideas and I need to solve it ASAP somehow. We cannot sustain this state. Thank you very much!

Here are some details about ceph and what Fio tests I did.

ceph -s

Code:

  cluster:
    id:     f629bf29-4936-4a79-9b66-fe188b93cb0e
    health: HEALTH_WARN
            nodeep-scrub flag(s) set
            1 pgs not deep-scrubbed in time
 
  services:
    mon: 4 daemons, quorum node1,node2,node3,node4 (age 14h)
    mgr: node1(active, since 14h), standbys: node4, node3, node2
    mds: 1/1 daemons up, 3 standby
    osd: 10 osds: 10 up (since 14h), 10 in (since 3d)
         flags nodeep-scrub
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 217 pgs
    objects: 29.64M objects, 2.2 TiB
    usage:   6.6 TiB used, 8.9 TiB / 16 TiB avail
    pgs:     217 active+clean
 
  io:
    client:   557 KiB/s rd, 200 KiB/s wr, 5 op/s rd, 38 op/s wr

pveversion: pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.35-2-pve)

ceph osd dump

Code:

epoch 4296
fsid f629bf29-4936-4a79-9b66-fe188b93cb0e
created 2020-10-10T09:26:10.068683+0000
modified 2022-06-19T22:44:16.119497+0000
flags nodeep-scrub,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 82
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'vmdata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3453 lfor 0/0/2703 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 3251 lfor 0/0/159 flags hashpspool stripe_width 0 application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode warn last_change 3251 lfor 0/0/2782 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'device_health_metrics' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 4255 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 6 '.rgw.root' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4095 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.log' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4097 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.control' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4099 flags hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.meta' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 4218 lfor 0/4218/4216 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
max_osd 11
osd.0 up   in  weight 1 up_from 4283 up_thru 4283 down_at 4279 last_clean_interval [4062,4278) [v2:172.16.0.10:6810/2017134,v1:172.16.0.10:6811/2017134] [v2:172.16.0.10:6814/2017134,v1:172.16.0.10:6816/2017134] exists,up 3866b7ef-7679-4fe0-97a5-e5eceaf2b3ad
osd.1 up   in  weight 1 up_from 4285 up_thru 4285 down_at 4280 last_clean_interval [4071,4278) [v2:172.16.0.10:6802/2017144,v1:172.16.0.10:6803/2017144] [v2:172.16.0.10:6804/2017144,v1:172.16.0.10:6805/2017144] exists,up 6ee7cabd-2359-40b5-8cfa-3b5b4357efd5
osd.2 up   in  weight 1 up_from 4288 up_thru 4288 down_at 4279 last_clean_interval [4065,4278) [v2:172.16.0.10:6812/2017118,v1:172.16.0.10:6813/2017118] [v2:172.16.0.10:6815/2017118,v1:172.16.0.10:6817/2017118] exists,up 427425b5-0e59-4793-94e1-6052ad129c2c
osd.3 up   in  weight 1 up_from 4274 up_thru 4288 down_at 4271 last_clean_interval [4249,4270) [v2:172.16.0.20:6805/20336,v1:172.16.0.20:6811/20336] [v2:172.16.0.20:6812/20336,v1:172.16.0.20:6813/20336] exists,up b7b11701-96ec-4251-a306-55c3ac534127
osd.5 up   in  weight 1 up_from 4277 up_thru 4288 down_at 4271 last_clean_interval [4250,4270) [v2:172.16.0.20:6802/20331,v1:172.16.0.20:6803/20331] [v2:172.16.0.20:6804/20331,v1:172.16.0.20:6806/20331] exists,up 3fae6f79-c828-4e0a-90cb-603344ce054f
osd.6 up   in  weight 1 up_from 4266 up_thru 4288 down_at 4263 last_clean_interval [4239,4262) [v2:172.16.0.30:6808/37937,v1:172.16.0.30:6809/37937] [v2:172.16.0.30:6810/37937,v1:172.16.0.30:6811/37937] exists,up 89eca724-3f3b-4ac1-bc3a-8a85ebc534fe
osd.7 up   in  weight 1 up_from 4269 up_thru 4288 down_at 4263 last_clean_interval [4242,4262) [v2:172.16.0.30:6800/37941,v1:172.16.0.30:6801/37941] [v2:172.16.0.30:6802/37941,v1:172.16.0.30:6803/37941] exists,up 3114db5c-4bba-4eb8-ad97-7aa60bb6951d
osd.8 up   in  weight 1 up_from 4261 up_thru 4288 down_at 4254 last_clean_interval [4084,4253) [v2:172.16.0.40:6818/1598145,v1:172.16.0.40:6819/1598145] [v2:172.16.0.40:6820/1598145,v1:172.16.0.40:6821/1598145] exists,up de2d40e3-091b-4e85-9c42-9014b5b125fb
osd.9 up   in  weight 1 up_from 4258 up_thru 4288 down_at 4254 last_clean_interval [4084,4253) [v2:172.16.0.40:6810/1598146,v1:172.16.0.40:6811/1598146] [v2:172.16.0.40:6812/1598146,v1:172.16.0.40:6813/1598146] exists,up bee36776-91f2-4b19-9b5e-cf0ff4a0830c
osd.10 up   in  weight 1 up_from 4261 up_thru 4288 down_at 4255 last_clean_interval [4084,4253) [v2:172.16.0.40:6802/1598147,v1:172.16.0.40:6803/1598147] [v2:172.16.0.40:6804/1598147,v1:172.16.0.40:6805/1598147] exists,up 8296a2a5-2cb3-4417-9cd8-33bee208f87d
blocklist 172.16.0.10:6801/2433354298 expires 2022-06-20T21:56:52.598285+0000
blocklist 172.16.0.10:6800/2433354298 expires 2022-06-20T21:56:52.598285+0000
blocklist 172.16.0.10:6826/1495 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.10:0/3744390068 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.10:0/2995033226 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.20:0/3269462353 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.20:6819/1629 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:0/1913297270 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.30:0/714088057 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.20:0/22525817 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:6818/1943 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.10:6827/1495 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.30:0/3582196443 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.20:6818/1629 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:6819/1943 expires 2022-06-20T21:28:22.648635+0000

Fio I have used:

Code:

fio --runtime=300 --time_based --name=random-read --rw=randread --size=128m --directory=DIR
fio --runtime=300 --time_based --name=random-read --rw=randread --size=4k --directory=DIR
fio --name=test-1 --numjobs=1 --rw=randrw --rwmixread=40 --bs=4k --iodepth=32 --size=4k --fsync=32 --runtime=600 --time_based --group_reporting --directory=DIR
fio --name=test-1 --numjobs=1 --rw=randrw --rwmixread=40 --bssplit=64k/47:4k/22:16k/12:8k/6:512/5:32k/4:12k/3:256k/1,8k/89:4k/11 --iodepth=32 --fsync=32 --runtime=600 --time_based --group_reporting --directory=DIR

All drives are NVMe, 3.84TB.

If anything more is needed to post here, let me know.

THANK YOU

P.S. Forget to mention, that after moving Mysql/Mariadb VM from ceph to local disk, problem with kworker processes is gone. But I don't know if it is related somehow with ceph or with galera.

P.S. Ceph also reporting some PGs active+clean+laggy or:

Code:

mds.node1(mds.0): XY slow metadata IOs are blocked > 30 secs, oldest blocked for 31 secs
mds.node1(mds.0): XY slow requests are blocked > 30 secs
XY slow ops, oldest one blocked for 37 sec, osd.X has slow ops

fitbrian · Jun 20, 2022

I have now started second mysq/mariadb which is still in Ceph (RBD) without mariadb service running (lets say mysql2). Just booted OS without any services or traffic. Then I run mariabackup on the only one production mysql VM which has it's data on local disk only dumping DB to the cephfs.

I/O waits are 55% and more on the mysql2 VM (empty one) . Nothing is running in this VM.

When db backup is completed, I/O waits go down on the mysql2. So, there is evidently something wrong but I'm not able to debug if it is network related (and how to prove it) or anything else.

dpl · Jun 22, 2022

We have had a similar problem, we could not solve it: https://forum.proxmox.com/threads/extremely-slow-ceph-storage-from-over-60-usage.101051/

Therefore, a few comments from my side:

1. With 4 nodes you should not have a monitor or standby MDS active on each node, if one node fails, the inaccessibility of the Ceph daemons directly affects the Ceph cluster.
= My recommendation is max. 3 Ceph Monitors, MGRs and MDS
2. I do not recommend CephFS, it is very resource hungry and latency dependent, especially with many created objects.
(See my problem description in the link above).
3. You have 8 pools?
= I guess that's too many in total compared to the maximum storage in your Ceph cluster, this gives problems with autoscaling of the placement groups

In summary my recommendation:

1. maximum 3 Ceph monitors
2. do not use CephFS but only RBD
3. maximum 3 pools (1x device_health_metrics and 2x rbd data pools)

fitbrian · Jun 23, 2022

Hi, thank you for your reply.

1) understand, I try to change it to the 3 monitors, MGRs and MDS
2) eventually we can replace cephfs by RBD only, but I guess, RBD cannot be used the way like cephfs is, right? I mean, as a mounted filesystem and distributed across multiple VMs by NFS for example. We use cephfs to store shared data for multiple VMs.
3) originally there was just 3 pools but I entered some RBD command in the console (I think rbd df it was) and another pool was created automatically. I don't know if it's safe to remove them.

dpl · Jun 23, 2022

2. CephFS is very resource hungry due to the metadata handling of the MDS, the setup should be well tuned, this requires good planning with division into HDD, SSD and NVME device classes and offloading of the Wal+DB.

Yes CephFS has the advantage that you can mount it multiple times directly (via Ceph client in a VM) or "natively" via librbd or krbd in Containers.

However, the parallel use of a mounted CephFS must also be used well thought out with the respective data protocols (SMB / NFS), especially with regard to file locking, caching (Oplocks etc.).

We no longer use CephFS, but only RBD, and distribute the SMB / NFS data connection via a separate VM with (inside) ZFS and Ceph backend storage.

From a high availability perspective, the VM may be a single point of failure, but that is acceptable as this is a very minimalistic Linux environment with just mentioned network protocol services.

3. the total storage space of the Ceph cluster and the increasing storage requirements will determine how many placement groups are needed dynamically.
In my opinion you should not exceed the "Safe Cluster Size" in the Ceph Calculator https://florian.ca/ceph-calculator/.
For example, if you have assigned the "Safe Cluster Size" to an RBD volume and fill it to the limit, and more OSDs fly out than what the pool can provide for the volume, you will quickly get problems with "backfillful osd(s)" and "Low space hindering backfill (add storage if this doesn't...)" Errors.

spirit · Jun 23, 2022

We no longer use CephFS, but only RBD, and distribute the SMB / NFS data connection via a separate VM with (inside) ZFS and Ceph backend storage.

From a high availability perspective, the VM may be a single point of failure, but that is acceptable as this is a very minimalistic Linux environment with just mentioned network protocol services.

I'm doing the same here.
Too much problem with cephfs (mostly with millions of small files), with finally use an HA vm with nfs (on rbd). No more problem.

fitbrian · Jun 23, 2022

OK, understand. As I mentioned, replacing Cephfs should be an option and I am open about it.

dpl said:
We no longer use CephFS, but only RBD, and distribute the SMB / NFS data connection via a separate VM with (inside) ZFS and Ceph backend storage.

From a high availability perspective, the VM may be a single point of failure, but that is acceptable as this is a very minimalistic Linux environment with just mentioned network protocol services.

spirit said:
Too much problem with cephfs (mostly with millions of small files), with finally use an HA vm with nfs (on rbd). No more problem.

A single point of failure is not acceptable to us. I can go to "VM as storage with NFS distribution" solution but I need to handle HA. What are your recommendations here regarding data replication? I can make HA by using Corosync/Pacemaker but with such a huge amount of data (currently 1.5TB and it will grow), what should be the best solution for replicated storage? DRBD on the top of the LVM, ZFS (I do not have many experiences with), GlusterFS (no experiences at all). What is your solution for your setup @spirit?

Thank you, guys.

dpl · Jun 23, 2022

You don't need to replicate the "payload" data of the VM itself, you have the shared Ceph RBD storage for that.
You only need to run the VM in the Proxmox cluster in HA mode.

(All other data cluster solutions like DRBD or GlusterFS are unnecessary)

fitbrian · Jun 23, 2022

dpl said:
You only need to run the VM in the Proxmox cluster in HA mode.

Oh, ok, understand. I guess you mean this setup https://pve.proxmox.com/wiki/High_Availability, right? I have not experience with HA VMs within the Proxmox cluster. I have to check it out in detail.

Thank you

hthpr · Jun 23, 2022

I had the same issue on our cluster with ceph suddenly showing slow ops for the nvme drives. ceph was already on Pacific. Nothing hardware wise changed on the servers and it affected random drives. But the regular SSDs had no issues. I checked the apt logs and for me the issues started once Proxmox switched to kernel 5.15. The slow ops suddenly appeared after the installation and activation of that new kernel. After a few minutes the errors usually went away again and everything worked again without any issues until ceph or something else decided it's time for slow ops again.

I don't know if it will fix your problems, but I haven't had the time to look more into the issue. I haven't been able to reproduce the slow ops on demand. I switched back to the previous kernel 5.13 when everything was stable and I haven't had any issues with slow ops anymore. Everything is running rock solid.

proxmox-boot-tool kernel pin 5.13.19-6-pve

forces proxmox to use that kernel.

I know it's not ideal to use an older kernel which doesn't get security fixes, but it's better than having your ceph cluster break at random times.

fitbrian · Jun 23, 2022

hthpr said:
I had the same issue on our cluster with ceph suddenly showing slow ops for the nvme drives. ceph was already on Pacific. Nothing hardware wise changed on the servers and it affected random drives. But the regular SSDs had no issues. I checked the apt logs and for me the issues started once Proxmox switched to kernel 5.15. The slow ops suddenly appeared after the installation and activation of that new kernel. After a few minutes the errors usually went away again and everything worked again without any issues until ceph or something else decided it's time for slow ops again.

I don't know if it will fix your problems, but I haven't had the time to look more into the issue. I haven't been able to reproduce the slow ops on demand. I switched back to the previous kernel 5.13 when everything was stable and I haven't had any issues with slow ops anymore. Everything is running rock solid.

proxmox-boot-tool kernel pin 5.13.19-6-pve

forces proxmox to use that kernel.

I know it's not ideal to use an older kernel which doesn't get security fixes, but it's better than having your ceph cluster break at random times.

that is very interesting, indeed. thanks, I will definitely test switch to an older kernel. Thanks!

acidrop · Jun 23, 2022

Moreover some links to Ceph docs which seem to relate to the issues you are seeing ...

https://docs.ceph.com/en/latest/cephfs/troubleshooting/

https://docs.ceph.com/en/latest/rad...d/?highlight=backfill#debugging-slow-requests

https://docs.ceph.com/en/latest/cephfs/eviction/

hthpr · Jun 24, 2022

fitbrian said:
that is very interesting, indeed. thanks, I will definitely test switch to an older kernel. Thanks!

Let me know if your system becomes stable again after the kernel change. Then we are onto something. BTW: I'm not using cephfs or have it enabled.

dpl · Jun 24, 2022

@hthpr
I read now often in the forum that there are problems with the 5.15 kernel (should also be problems with the pci / gpu passthrough).

The changelog of 5.15.49 is long, https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.15.49
if necessary it is worth to search for "scheduler" fixes.

The Proxmox repository currently offers only up to version pve-kernel-5.15.35-3-pve.

dpl · Jun 24, 2022

Base Changelog https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.15

fitbrian · Jun 26, 2022

I've made now a test with VM as storage (all RBD) and storage dir exported with NFS. I mounted this NFS export to the running production mysql server, change backup script to export backups to this storage VM (NFS export) and the problem is gone. No slow ops, no high I/Os, and everything seems to be running well. It's just the first test I've made now, I will do some more testing yet (switching some data from the cephfs mountpoint to this storage VM also on webserver, etc.), but so far so good. So, it seems like the problem is with cephfs indeed. Either regarding the newer kernel version (I had no space to test the older kernel yet) or with cephfs itself, for example, some bug in Octopus and Pacific since with Nautilus, everything worked well without any issues.

YAGA · Jul 10, 2022

Hi,

I've the same issues with CephFS (slow requests - slow ops, oldest one blocked for xxx sec - freezes...) but Ceph RBD is working properly.

CephFS is unusable for the backups. Now backups needs hours or even days.

This problem suddenly appeared several months ago with Ceph Pacific. No issue with Ceph Octopus, the previous version.

Now I've upgraded Ceph Pacific to Ceph Quincy, same result Ceph RDB is ok but CephFS is definitely too slow with warnings : slow requests - slow ops, oldest one blocked for xxx sec...

Here is my setup :
- Cluster with 4 nodes
- 3 osd (hdd) per node i.e. 12 osd for the cluster.
- Dedicated 10 Gbit/s network for Ceph (iperf is ok 9.5 GB/s)
- Ceph RDB performance is ok : rados bench -> Bandwidth (MB/sec): 109.16
- Ceph Quincy

I have no clue to solve the problem.

Any advice welcome.

dpl · Jul 10, 2022

@YAGA

- add SSDs / NVMEs to the nodes
- create a "replicated_rule" based on "device-class" and move the "cephfs_metadata" pool to the SSDs/NVMEs

Maybe this will speed up your CephFS "a bit".

YAGA · Jul 11, 2022

dpl said:
@YAGA

- add SSDs / NVMEs to the nodes
- create a "replicated_rule" based on "device-class" and move the "cephfs_metadata" pool to the SSDs/NVMEs

Maybe this will speed up your CephFS "a bit".

Hi @dpl

Thanks for your message, it's a good point. I'll try to do that.

But I would like to understand why my system which worked perfectly before with ceph FS has become so slow that it becomes unusable.

This is probably due to either a Proxmox PVE update or a Proxmox Ceph update.

gyverchang · Jan 14, 2023

Hey guys, I am also getting this issue, however, I am not using CephFS and this is based on a new cluster setup with nvme. I posted a question on the forums as well: https://forum.proxmox.com/threads/ceph-slow-ops.121033/

Just to summarize: new set up, added OSDs, create pools and I get slow ops all the over the place and no good pgs at all.

Tried downgrading the kernel as suggested but to no avail.

Ceph is unusable on my 3 node instance so I switched back to Zfs for now.

Not sure if any of you guys have a solution for this issue?

Thanks!

Ceph: sudden slow ops, freezes, and slow-downs

Member

Member

Active Member

Member

Active Member

Distinguished Member

Member

Active Member

Member

Active Member

Member

Renowned Member

Active Member

Active Member

Active Member

Member

Renowned Member

Active Member

Renowned Member

New Member

Attachments

We value your privacy