Hello,
I would like to ask you for help because I am running out of ideas on how to solve our issue.
We run 4 nodes Proxmox Ceph cluster on OVH. The internal network for the cluster is built on OVH vRack with a bandwidth 4Gbps. Within the cluster, we use CephFS as storage for shared data that webservers (2 VM) are using, such as web data (scripts), images, vhosts, and some others. Those data are connected over NFS to webservers' VMs. In addition, there are 3 MariaDB VMs with Galera3 multi-master replication. There are some more VM like Redis which are not important in this case. All VMs have their disks/images in Ceph.
Since the cluster was built, we have used Ceph Nautilus, without absolutely no problems and during the main season, traffic was at least double that now.
Two weeks ago I have updated Ceph from Nautilus to Octopus and MariaDB VMs to Deb Bullseye. Everything seems working well, except that a few times per day suddenly a huge number of kworker/sda processes appeared on the MariaDB VMs (like 2000 processes like that) but everything worked fine and after a few minutes, they went gone. I did not find any reason for that. Nothing in logs, no slow iops, graphs looked fine. So, I kept it as it was.
A few days ago (last Thursday) Ceph started to report slow ops and logged errors like this:
The errors appeared on all nodes within all OSDs. MariaDB servers have high I/O waits (90% and more), NFS on webservers started to respond slowly, I/O waits on webservers were high too. My first impression was network problems. So, I made a double-check network communication across all servers and everything was good. No packet losses and ping were good. I have contacted OVH support if there is something wrong possibly on their side but they confirm they are good. So, I rebooted all nodes, I tried to replace vrack service (put all servers out, create a new vrack in ovh and add them back to a newly created one). I double-checked all disks health. Nothing has helped to solve the problem and it affected our production services very badly.
So, I had to put one disk out of the Ceph as a single/local/standalone disk on one pve server and in the first stage I migrated one of the MariaDB servers to this local disk and stopped Galera replication, so only one mysql stayed working. The situation was a little bit better regarding the affections of our services but not ideal - webserver still froze with high I/O and ceph still logged slow ops, of course. Therefore I moved one of the webservers to the same local disk as mariadb server, out of the ceph. I also moved all scripts and webapp data (except for static files like vhosts and images) from ceph to local disk within webserver and in the load-balancing, I have disabled the second webserver. The current state is we run only with one VM as webserver and one VM as mysql/mariadb without replication (without even RAID!). In this stage, the situation returned to normal and our services worked as before and are stable. Ceph was not logging any other slow ops messages. Except for one situation, which is mysql backup. When mysql backup is executed, by using mariabackup stream backup, slow iops and ceph slow ops errors are back. Backup is provided to the cephfs connected to the mysql/mariadb VM.
I have tried to do some I/O stress tests by fio utility. I have tested:
- from PVE machines to the mounted cephfs (/mnt/pve/cephfs),
- from Mysql VM to mounted cephfs
- from webserver to mounted cephfs over NFS
- from the second webserver (which is located in Ceph not on the local disk as currently working webserver) to root (/) and also to cephfs mounted over NFS
Every FIO test was successful, no ceph slow ops, not high I/Os. No network speed and disk speed issues, no errors. In the other words, I am not able to reproduce the situation when Ceph is reporting slow ops and everything goes to....
Yesterday I upgraded all physical machines to the Debian Bullseye and Ceph to Pacific. So now all is updated to the very latest version available. But the slow ops appeared again (as I described before, now it is happening where mysql is backuping up).
I tried to google the error and possible symptoms but I did not found anything helpful to solve this issue. Our production infrastructure is now running in very limited, a temporary state with only one web server and one mysql/mariadb server, not replicated, disk not mirrored, no H/A, no load-balancing. Everything has had to been disabled.
I have found something interested in "ceph osd dump" which is "blocklist". I have no idea if this might be related somehow and if so how and how to solve it. I even do not know if something like that was there before the problems occurs.
Do you someone has any idea what should be possible wrong, how to diagnose/debug and solve the problem, please? I am truly out of ideas and I need to solve it ASAP somehow. We cannot sustain this state. Thank you very much!
Here are some details about ceph and what Fio tests I did.
ceph -s
pveversion: pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.35-2-pve)
ceph osd dump
Fio I have used:
All drives are NVMe, 3.84TB.
If anything more is needed to post here, let me know.
THANK YOU
P.S. Forget to mention, that after moving Mysql/Mariadb VM from ceph to local disk, problem with kworker processes is gone. But I don't know if it is related somehow with ceph or with galera.
P.S. Ceph also reporting some PGs active+clean+laggy or:
I would like to ask you for help because I am running out of ideas on how to solve our issue.
We run 4 nodes Proxmox Ceph cluster on OVH. The internal network for the cluster is built on OVH vRack with a bandwidth 4Gbps. Within the cluster, we use CephFS as storage for shared data that webservers (2 VM) are using, such as web data (scripts), images, vhosts, and some others. Those data are connected over NFS to webservers' VMs. In addition, there are 3 MariaDB VMs with Galera3 multi-master replication. There are some more VM like Redis which are not important in this case. All VMs have their disks/images in Ceph.
Since the cluster was built, we have used Ceph Nautilus, without absolutely no problems and during the main season, traffic was at least double that now.
Two weeks ago I have updated Ceph from Nautilus to Octopus and MariaDB VMs to Deb Bullseye. Everything seems working well, except that a few times per day suddenly a huge number of kworker/sda processes appeared on the MariaDB VMs (like 2000 processes like that) but everything worked fine and after a few minutes, they went gone. I did not find any reason for that. Nothing in logs, no slow iops, graphs looked fine. So, I kept it as it was.
A few days ago (last Thursday) Ceph started to report slow ops and logged errors like this:
Code:
Jun 18 09:07:14 node3 ceph-osd[2030776]: 2022-06-18T09:07:14.840+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:15 node3 ceph-osd[2030776]: 2022-06-18T09:07:15.876+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:16 node3 ceph-osd[2030776]: 2022-06-18T09:07:16.920+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:17 node3 ceph-osd[2030776]: 2022-06-18T09:07:17.876+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
Jun 18 09:07:18 node3 ceph-osd[2030776]: 2022-06-18T09:07:18.868+0000 7f6c048b2700 -1 osd.7 4054 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.238396093.0:43452 2.39 2:9ee102a9:::10007025b83.0000002f:head [write 0~4194304 in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4054)
OR
Jun 20 13:30:45 node1 ceph-osd[2017144]: 2022-06-20T13:30:45.131+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)
Jun 20 13:30:46 node1 ceph-osd[2017144]: 2022-06-20T13:30:46.099+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)
Jun 20 13:30:47 node1 ceph-osd[2017144]: 2022-06-20T13:30:47.083+0000 7f425199d700 -1 osd.1 4296 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.240584289.0:3587 2.a 2:52f56778:::10007046abf.00000035:head [write 0~4194304 [2@0] in=4194304b] snapc 1=[] ondisk+write+known_if_redirected e4296)
The errors appeared on all nodes within all OSDs. MariaDB servers have high I/O waits (90% and more), NFS on webservers started to respond slowly, I/O waits on webservers were high too. My first impression was network problems. So, I made a double-check network communication across all servers and everything was good. No packet losses and ping were good. I have contacted OVH support if there is something wrong possibly on their side but they confirm they are good. So, I rebooted all nodes, I tried to replace vrack service (put all servers out, create a new vrack in ovh and add them back to a newly created one). I double-checked all disks health. Nothing has helped to solve the problem and it affected our production services very badly.
So, I had to put one disk out of the Ceph as a single/local/standalone disk on one pve server and in the first stage I migrated one of the MariaDB servers to this local disk and stopped Galera replication, so only one mysql stayed working. The situation was a little bit better regarding the affections of our services but not ideal - webserver still froze with high I/O and ceph still logged slow ops, of course. Therefore I moved one of the webservers to the same local disk as mariadb server, out of the ceph. I also moved all scripts and webapp data (except for static files like vhosts and images) from ceph to local disk within webserver and in the load-balancing, I have disabled the second webserver. The current state is we run only with one VM as webserver and one VM as mysql/mariadb without replication (without even RAID!). In this stage, the situation returned to normal and our services worked as before and are stable. Ceph was not logging any other slow ops messages. Except for one situation, which is mysql backup. When mysql backup is executed, by using mariabackup stream backup, slow iops and ceph slow ops errors are back. Backup is provided to the cephfs connected to the mysql/mariadb VM.
I have tried to do some I/O stress tests by fio utility. I have tested:
- from PVE machines to the mounted cephfs (/mnt/pve/cephfs),
- from Mysql VM to mounted cephfs
- from webserver to mounted cephfs over NFS
- from the second webserver (which is located in Ceph not on the local disk as currently working webserver) to root (/) and also to cephfs mounted over NFS
Every FIO test was successful, no ceph slow ops, not high I/Os. No network speed and disk speed issues, no errors. In the other words, I am not able to reproduce the situation when Ceph is reporting slow ops and everything goes to....
Yesterday I upgraded all physical machines to the Debian Bullseye and Ceph to Pacific. So now all is updated to the very latest version available. But the slow ops appeared again (as I described before, now it is happening where mysql is backuping up).
I tried to google the error and possible symptoms but I did not found anything helpful to solve this issue. Our production infrastructure is now running in very limited, a temporary state with only one web server and one mysql/mariadb server, not replicated, disk not mirrored, no H/A, no load-balancing. Everything has had to been disabled.
I have found something interested in "ceph osd dump" which is "blocklist". I have no idea if this might be related somehow and if so how and how to solve it. I even do not know if something like that was there before the problems occurs.
Do you someone has any idea what should be possible wrong, how to diagnose/debug and solve the problem, please? I am truly out of ideas and I need to solve it ASAP somehow. We cannot sustain this state. Thank you very much!
Here are some details about ceph and what Fio tests I did.
ceph -s
Code:
cluster:
id: f629bf29-4936-4a79-9b66-fe188b93cb0e
health: HEALTH_WARN
nodeep-scrub flag(s) set
1 pgs not deep-scrubbed in time
services:
mon: 4 daemons, quorum node1,node2,node3,node4 (age 14h)
mgr: node1(active, since 14h), standbys: node4, node3, node2
mds: 1/1 daemons up, 3 standby
osd: 10 osds: 10 up (since 14h), 10 in (since 3d)
flags nodeep-scrub
data:
volumes: 1/1 healthy
pools: 8 pools, 217 pgs
objects: 29.64M objects, 2.2 TiB
usage: 6.6 TiB used, 8.9 TiB / 16 TiB avail
pgs: 217 active+clean
io:
client: 557 KiB/s rd, 200 KiB/s wr, 5 op/s rd, 38 op/s wr
pveversion: pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.35-2-pve)
ceph osd dump
Code:
epoch 4296
fsid f629bf29-4936-4a79-9b66-fe188b93cb0e
created 2020-10-10T09:26:10.068683+0000
modified 2022-06-19T22:44:16.119497+0000
flags nodeep-scrub,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 82
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'vmdata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3453 lfor 0/0/2703 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 3251 lfor 0/0/159 flags hashpspool stripe_width 0 application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode warn last_change 3251 lfor 0/0/2782 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'device_health_metrics' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 4255 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 6 '.rgw.root' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4095 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.log' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4097 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.control' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4099 flags hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.meta' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 4218 lfor 0/4218/4216 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
max_osd 11
osd.0 up in weight 1 up_from 4283 up_thru 4283 down_at 4279 last_clean_interval [4062,4278) [v2:172.16.0.10:6810/2017134,v1:172.16.0.10:6811/2017134] [v2:172.16.0.10:6814/2017134,v1:172.16.0.10:6816/2017134] exists,up 3866b7ef-7679-4fe0-97a5-e5eceaf2b3ad
osd.1 up in weight 1 up_from 4285 up_thru 4285 down_at 4280 last_clean_interval [4071,4278) [v2:172.16.0.10:6802/2017144,v1:172.16.0.10:6803/2017144] [v2:172.16.0.10:6804/2017144,v1:172.16.0.10:6805/2017144] exists,up 6ee7cabd-2359-40b5-8cfa-3b5b4357efd5
osd.2 up in weight 1 up_from 4288 up_thru 4288 down_at 4279 last_clean_interval [4065,4278) [v2:172.16.0.10:6812/2017118,v1:172.16.0.10:6813/2017118] [v2:172.16.0.10:6815/2017118,v1:172.16.0.10:6817/2017118] exists,up 427425b5-0e59-4793-94e1-6052ad129c2c
osd.3 up in weight 1 up_from 4274 up_thru 4288 down_at 4271 last_clean_interval [4249,4270) [v2:172.16.0.20:6805/20336,v1:172.16.0.20:6811/20336] [v2:172.16.0.20:6812/20336,v1:172.16.0.20:6813/20336] exists,up b7b11701-96ec-4251-a306-55c3ac534127
osd.5 up in weight 1 up_from 4277 up_thru 4288 down_at 4271 last_clean_interval [4250,4270) [v2:172.16.0.20:6802/20331,v1:172.16.0.20:6803/20331] [v2:172.16.0.20:6804/20331,v1:172.16.0.20:6806/20331] exists,up 3fae6f79-c828-4e0a-90cb-603344ce054f
osd.6 up in weight 1 up_from 4266 up_thru 4288 down_at 4263 last_clean_interval [4239,4262) [v2:172.16.0.30:6808/37937,v1:172.16.0.30:6809/37937] [v2:172.16.0.30:6810/37937,v1:172.16.0.30:6811/37937] exists,up 89eca724-3f3b-4ac1-bc3a-8a85ebc534fe
osd.7 up in weight 1 up_from 4269 up_thru 4288 down_at 4263 last_clean_interval [4242,4262) [v2:172.16.0.30:6800/37941,v1:172.16.0.30:6801/37941] [v2:172.16.0.30:6802/37941,v1:172.16.0.30:6803/37941] exists,up 3114db5c-4bba-4eb8-ad97-7aa60bb6951d
osd.8 up in weight 1 up_from 4261 up_thru 4288 down_at 4254 last_clean_interval [4084,4253) [v2:172.16.0.40:6818/1598145,v1:172.16.0.40:6819/1598145] [v2:172.16.0.40:6820/1598145,v1:172.16.0.40:6821/1598145] exists,up de2d40e3-091b-4e85-9c42-9014b5b125fb
osd.9 up in weight 1 up_from 4258 up_thru 4288 down_at 4254 last_clean_interval [4084,4253) [v2:172.16.0.40:6810/1598146,v1:172.16.0.40:6811/1598146] [v2:172.16.0.40:6812/1598146,v1:172.16.0.40:6813/1598146] exists,up bee36776-91f2-4b19-9b5e-cf0ff4a0830c
osd.10 up in weight 1 up_from 4261 up_thru 4288 down_at 4255 last_clean_interval [4084,4253) [v2:172.16.0.40:6802/1598147,v1:172.16.0.40:6803/1598147] [v2:172.16.0.40:6804/1598147,v1:172.16.0.40:6805/1598147] exists,up 8296a2a5-2cb3-4417-9cd8-33bee208f87d
blocklist 172.16.0.10:6801/2433354298 expires 2022-06-20T21:56:52.598285+0000
blocklist 172.16.0.10:6800/2433354298 expires 2022-06-20T21:56:52.598285+0000
blocklist 172.16.0.10:6826/1495 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.10:0/3744390068 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.10:0/2995033226 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.20:0/3269462353 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.20:6819/1629 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:0/1913297270 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.30:0/714088057 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.20:0/22525817 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:6818/1943 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.10:6827/1495 expires 2022-06-20T21:49:52.496258+0000
blocklist 172.16.0.30:0/3582196443 expires 2022-06-20T21:28:22.648635+0000
blocklist 172.16.0.20:6818/1629 expires 2022-06-20T21:39:57.838468+0000
blocklist 172.16.0.30:6819/1943 expires 2022-06-20T21:28:22.648635+0000
Fio I have used:
Code:
fio --runtime=300 --time_based --name=random-read --rw=randread --size=128m --directory=DIR
fio --runtime=300 --time_based --name=random-read --rw=randread --size=4k --directory=DIR
fio --name=test-1 --numjobs=1 --rw=randrw --rwmixread=40 --bs=4k --iodepth=32 --size=4k --fsync=32 --runtime=600 --time_based --group_reporting --directory=DIR
fio --name=test-1 --numjobs=1 --rw=randrw --rwmixread=40 --bssplit=64k/47:4k/22:16k/12:8k/6:512/5:32k/4:12k/3:256k/1,8k/89:4k/11 --iodepth=32 --fsync=32 --runtime=600 --time_based --group_reporting --directory=DIR
All drives are NVMe, 3.84TB.
If anything more is needed to post here, let me know.
THANK YOU
P.S. Forget to mention, that after moving Mysql/Mariadb VM from ceph to local disk, problem with kworker processes is gone. But I don't know if it is related somehow with ceph or with galera.
P.S. Ceph also reporting some PGs active+clean+laggy or:
Code:
mds.node1(mds.0): XY slow metadata IOs are blocked > 30 secs, oldest blocked for 31 secs
mds.node1(mds.0): XY slow requests are blocked > 30 secs
XY slow ops, oldest one blocked for 37 sec, osd.X has slow ops
Last edited: