Proxmox on Ceph: After PG_NUM increase high read io on NVMEs - VMs unusable

Rainerle

Renowned Member
Jan 29, 2019
120
35
68
Hi,

the autoscaler increased the number of PGs on our Ceph storage (Hardware like this but 5 nodes).

As soon as the backfill starts the VMs become unusable and we startet killing OSD processes that cause high read io load. So as in this picture we would kill the ceph-osd process working on dm-3on proxmox07 and the two on proxmox10. As soon as we killed the 3 OSDs the Ceph woud go into recovery and the VMs would become usable again.

How to throttle backfilling?

We tried:
Code:
osd_backfill_scan_max           16                                                     mon
osd_backfill_scan_min           4                                                      mon
osd_client_op_priority          63                                                     mon
osd_max_backfills               1                                                      mon
osd_recovery_op_priority        1                                                      mon
osd_recovery_sleep              3.000000                                               mon
...but it does not prevent the NVMe's to go into 100% read io usage.

1612807639259.png
 
Last edited:
As soon as the backfill starts the VMs become unusable and we startet killing OSD processes that cause high read io load.
What's the resource consumption in general (eg. with atop)? Usually reads are very well cached (eg. VM OS level).

the autoscaler increased the number of PGs on our Ceph storage (Hardware like this but 5 nodes).
Have you seen the thread on the autoscaler discussion?
https://forum.proxmox.com/threads/c...think-twice-before-enabling-auto-scale.80105/
 
  • Like
Reactions: Rainerle
So,
the monday was horrible, our customers started to hate us again...

Carefully(!!!) restarting the OSD processes and using the recoveries to keep the backfilling from starting up got us over the day, and late in the evening we found the bluefs_buffered_io setting. So that shifted the read IO problem to a OSD daemon CPU usage problem and then allowed us at around 22:30 to run the backfilling with osd_max_backfills set to 8 and osd_recovery_sleep set to 0 at full speed (around 2GByte/s hammering onto the NVMe's).

We are left now with 6 PGs still backfilling. But look at the time estimate:
1612998340423.png


This was the state on tuesday:
Code:
root@proxmox07:~# ceph pg ls | grep backfill
2.36   305353        0    132892       0 24686921003           0          0 3769 active+remapped+backfilling    4m  13592'5295371   15162:4681211   [5,28,12]p5  [12,8,22]p12 2021-02-08 11:20:54.493984 2021-02-05 22:27:51.390415
2.4a   305673        0    591510       0 24682012325           0          0 3064 active+remapped+backfilling    4m  14960'5271067   15162:4648297 [16,21,28]p16  [16,5,18]p16 2021-02-08 11:36:59.143589 2021-02-05 20:15:07.458819
2.58   306526        0    889209       0 25719764446           0          0 3096 active+remapped+backfilling    4m  14995'5322900   15162:4690127    [9,3,26]p9 [16,18,24]p16 2021-02-08 12:04:12.570220 2021-02-02 04:52:53.265575
2.6a   306477        0    888612       0 24980563219           0          0 3012 active+remapped+backfilling    4m  14995'5272216   15162:5893516  [16,19,7]p16 [12,23,24]p12 2021-02-07 08:11:30.533042 2021-02-05 20:15:07.458819
2.78   307731        0    715476       0 25901673569           0          0 3901 active+remapped+backfilling   24h  13592'5317021   15162:4952053    [3,15,7]p3 [16,18,24]p16 2021-02-08 11:37:48.906766 2021-02-02 04:52:53.265575
2.7a   306884        0    713184       0 24739768170           0          0 3006 active+remapped+backfilling   23h  14995'5309988   15162:6143915  [19,14,1]p19 [12,20,29]p12 2021-02-08 11:20:27.224508 2021-02-03 14:15:15.951027

This was the state this morning:
Code:
root@proxmox07:~# ceph pg ls | grep backfill
2.36   305353        0    111926       0 24686921003           0          0 3769 active+remapped+backfilling    8h  13592'5295371   15172:4712680   [5,28,12]p5  [12,8,22]p12 2021-02-08 11:20:54.493984 2021-02-05 22:27:51.390415
2.4a   305673        0    570544       0 24682012325           0          0 3064 active+remapped+backfilling    8h  14960'5271067   15172:4679766 [16,21,28]p16  [16,5,18]p16 2021-02-08 11:36:59.143589 2021-02-05 20:15:07.458819
2.58   306526        0    857760       0 25719764446           0          0 3096 active+remapped+backfilling    8h  14995'5322900   15172:4732079    [9,3,26]p9 [16,18,24]p16 2021-02-08 12:04:12.570220 2021-02-02 04:52:53.265575
2.6a   306477        0    857160       0 24980563219           0          0 3012 active+remapped+backfilling    8h  14995'5272216   15172:5935472  [16,19,7]p16 [12,23,24]p12 2021-02-07 08:11:30.533042 2021-02-05 20:15:07.458819
2.78   307731        0    684024       0 25901673569           0          0 3901 active+remapped+backfilling   33h  13592'5317021   15172:4994009    [3,15,7]p3 [16,18,24]p16 2021-02-08 11:37:48.906766 2021-02-02 04:52:53.265575
2.7a   306884        0    681735       0 24739768170           0          0 3006 active+remapped+backfilling   32h  14995'5309988   15172:6185867  [19,14,1]p19 [12,20,29]p12 2021-02-08 11:20:27.224508 2021-02-03 14:15:15.951027

And this is the state from a few minutes before:
Code:
root@proxmox07:~# ceph pg ls | grep backfill
2.36   305353        0     74202       0 24686921003           0          0 3769 active+remapped+backfilling   24h  13592'5295371   15187:4769296   [5,28,12]p5  [12,8,22]p12 2021-02-08 11:20:54.493984 2021-02-05 22:27:51.390415
2.4a   305673        0    532816       0 24682012325           0          0 3064 active+remapped+backfilling   24h  14960'5271067   15187:4736388 [16,21,28]p16  [16,5,18]p16 2021-02-08 11:36:59.143589 2021-02-05 20:15:07.458819
2.58   306526        0    801171       0 25719764446           0          0 3096 active+remapped+backfilling   24h  14995'5322900   15187:4807561    [9,3,26]p9 [16,18,24]p16 2021-02-08 12:04:12.570220 2021-02-02 04:52:53.265575
2.6a   306477        0    800577       0 24980563219           0          0 3012 active+remapped+backfilling   24h  14995'5272216   15187:6010946  [16,19,7]p16 [12,23,24]p12 2021-02-07 08:11:30.533042 2021-02-05 20:15:07.458819
2.78   307731        0    627435       0 25901673569           0          0 3901 active+remapped+backfilling    2d  13592'5317021   15187:5069491    [3,15,7]p3 [16,18,24]p16 2021-02-08 11:37:48.906766 2021-02-02 04:52:53.265575
2.7a   306884        0    625149       0 24739768170           0          0 3006 active+remapped+backfilling    2d  14995'5309988   15187:6261345  [19,14,1]p19 [12,20,29]p12 2021-02-08 11:20:27.224508 2021-02-03 14:15:15.951027

The current relevant settings for the backfilling are:
Code:
root@proxmox07:~# ceph config show-with-defaults osd.0 | grep 'osd_max_backfills\|osd_recovery_max_active\|osd_recovery_sleep'
osd_max_backfills                                          3
             mon
osd_recovery_max_active                                    3
             default
osd_recovery_sleep                                         0.000000
             default
osd_recovery_sleep_hdd                                     0.100000
             default
osd_recovery_sleep_hybrid                                  0.025000
             default
osd_recovery_sleep_ssd                                     0.000000
             default
root@proxmox07:~#

The current recovery/backfilling speed is terrible and I do not have any idea where else to look at...
 
Adjusted settings after reading this.

Code:
root@proxmox07:~# ceph config show-with-defaults osd.0 | grep 'osd_max_backfills\|osd_recovery_max_active\|osd_recovery_sleep'
osd_max_backfills                                          3
             mon
osd_recovery_max_active                                    3
             default
osd_recovery_sleep                                         0.000000
             default
osd_recovery_sleep_hdd                                     0.000000
             mon
osd_recovery_sleep_hybrid                                  0.000000
             mon
osd_recovery_sleep_ssd                                     0.000000
             default
root@proxmox07:~#

Does not seem to help the backfilling...

Code:
root@proxmox07:~# ceph status

  cluster:

    id:     80c20b8a-00f0-4e3a-954f-e6940f603ba3

    health: HEALTH_OK


  services:

    mon: 5 daemons, quorum proxmox07,proxmox08,proxmox09,proxmox10,proxmox11 (age 4d)

    mgr: proxmox07(active, since 2d), standbys: proxmox10, proxmox08, proxmox09, proxmox11

    mds: cephfs:1 {0=proxmox07=up:active} 4 up:standby

    osd: 30 osds: 30 up (since 2d), 30 in (since 2M); 6 remapped pgs


  task status:

    scrub status:

        mds.proxmox07: idle


  data:

    pools:   3 pools, 2184 pgs

    objects: 42.46M objects, 7.5 TiB

    usage:   20 TiB used, 67 TiB / 87 TiB avail

    pgs:     3449843/127369341 objects misplaced (2.709%)

             2178 active+clean

             6    active+remapped+backfilling


  io:

    client:   1.6 MiB/s rd, 12 MiB/s wr, 247 op/s rd, 520 op/s wr

    recovery: 128 KiB/s, 2 objects/s


root@proxmox07:~#
 
Is the autoscaler still adjusting the PG count? That process could also lead to the long time estimate.

restarting the OSD processes and using the recoveries to keep the backfilling from starting up got us over the day
You can ceph osd set nobackfill and norebalance to temporarily stop those actions.
 
Is the autoscaler still adjusting the PG count? That process could also lead to the long time estimate.
No, the pg_num and pgp_num are now at 128 and I changed the autoscaler to "warn" for these pools.
Code:
root@proxmox07:~# ceph osd pool ls detail
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 15199 lfor 0/14990/15006 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
        removed_snaps [2~2,5~5,21~5,28~ea,113~5,119~1c,136~30,167~25,18d~a,199~30,1cb~30,1fc~30,22d~31,25f~30,290~2f,2c0~30,2f2~32,325~2e,354~13,368~1,36a~1,36c~1,36e~1,370~1,372~1,374~1,376~1,378~1,37a~1,37c~1,37e~1,380~1,382~1,384~1,386~1,388~1,38a~1,38c~1,38e~1,390~1,392~1,394~1,396~1,398~1,39a~1,39c~1,39e~1,3a0~1,3a2~1,3a4~1,3a6~1,3a8~1,3aa~1,3ac~1,3ae~1,3b0~1,3b2~1,3b4~1,3b7~2,3ba~1,3bc~1,3be~1,3c0~1,3c2~1,3c4~1,3c6~1,3c8~1,3ca~1,3cc~1]
        removed_snaps_queue [6f~1,88~1,fd~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~1,2ec~1,2ee~1,2f2~1,2f5~1,2f7~1,2f9~1,2fb~1,2fd~1,2ff~1,301~1,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1,317~1,319~1,31b~1,31d~1,31f~1,321~1,323~1,327~1,329~1,32b~1,32d~1,32f~1,331~1,333~1,335~1,337~1,339~1,33b~1,33d~1,33f~14,354~13,368~1,36a~1,36c~1,36e~1,370~1,372~1,374~1,376~1,378~1,37a~1,37c~1,37e~1,380~1,382~1,384~1,386~1,388~1,38a~1,38c~1,38e~1,390~1,392~1,394~1,396~1,398~1,39a~1,39c~1,39e~1,3a0~1,3a2~1,3a4~1,3a6~1,3a8~1,3aa~1,3ac~1,3ae~1,3b0~1,3b2~1,3b4~1,3b7~2,3ba~1,3bc~1,3be~1,3c0~1,3c2~1,3c4~1,3c6~1,3c8~1,3ca~1,3cc~1]
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 13478 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 4 'ceph-proxmox-VMs' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 13466 lfor 0/0/4000 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~5,7~1,9~21]

root@proxmox07:~#


You can ceph osd set nobackfill and norebalance to temporarily stop those actions.
I set nobackfill, but that did not help. I did not know about norebalance at that point in time. Hopefully I will never have to use that new knowledge ever...
 
Yes, i can check them, but what should I be looking for???
Code:
root@proxmox07:~# ceph pg ls | head -n1
PG    OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE                       SINCE VERSION        REPORTED        UP            ACTING        SCRUB_STAMP                DEEP_SCRUB_STAMP
root@proxmox07:~# ceph pg ls | grep backfill
2.36   305353        0     21372       0 24686921003           0          0 3769 active+remapped+backfilling   46h  13592'5295371   15212:4848589   [5,28,12]p5  [12,8,22]p12 2021-02-08 11:20:54.493984 2021-02-05 22:27:51.390415
2.4a   305673        0    479986       0 24682012325           0          0 3064 active+remapped+backfilling   46h  14960'5271067   15212:4815681 [16,21,28]p16  [16,5,18]p16 2021-02-08 11:36:59.143589 2021-02-05 20:15:07.458819
2.58   306526        0    721926       0 25719764446           0          0 3096 active+remapped+backfilling   46h  14995'5322900   15212:4913269    [9,3,26]p9 [16,18,24]p16 2021-02-08 12:04:12.570220 2021-02-02 04:52:53.265575
2.6a   306477        0    721332       0 24980563219           0          0 3012 active+remapped+backfilling   46h  14995'5272216   15212:6116654  [16,19,7]p16 [12,23,24]p12 2021-02-07 08:11:30.533042 2021-02-05 20:15:07.458819
2.78   307731        0    548187       0 25901673569           0          0 3901 active+remapped+backfilling    2d  13592'5317021   15212:5175203    [3,15,7]p3 [16,18,24]p16 2021-02-08 11:37:48.906766 2021-02-02 04:52:53.265575
2.7a   306884        0    545904       0 24739768170           0          0 3006 active+remapped+backfilling    2d  14995'5309988   15212:6367053  [19,14,1]p19 [12,20,29]p12 2021-02-08 11:20:27.224508 2021-02-03 14:15:15.951027
root@proxmox07:~#
So the UP PGs in the list are the remap/backfill destinations and ACTING are the source PGs.

The numbers in the brackets are the OSD IDs.

Looking into an ACTING PG OSD's log does not show anything interessting to me...
Code:
root@proxmox07:~# journalctl -fu  ceph-osd@5
-- Logs begin at Sat 2021-02-06 17:40:53 CET. --
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.396 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413538 (15175'5410474,15189'5413538] local-lis/les=14894/14895 n=294331 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413536 crt=15189'5413538 lcod 15189'5413534 mlcod 15189'5413534 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.404 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413540 (15175'5410474,15189'5413540] local-lis/les=14894/14895 n=294329 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 crt=15189'5413540 lcod 15189'5413538 mlcod 15189'5413538 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.456 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413578 (15176'5410574,15189'5413578] local-lis/les=14894/14895 n=294327 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413576 crt=15189'5413578 lcod 15189'5413574 mlcod 15189'5413574 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.464 7f6969e59700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413586 (15176'5410574,15189'5413586] local-lis/les=14894/14895 n=294325 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413584 crt=15189'5413586 lcod 15189'5413582 mlcod 15189'5413582 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.476 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413592 (15176'5410574,15189'5413592] local-lis/les=14894/14895 n=294323 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 crt=15189'5413592 lcod 15189'5413590 mlcod 15189'5413590 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.504 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413610 (15176'5410574,15189'5413610] local-lis/les=14894/14895 n=294320 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413608 crt=15189'5413610 lcod 15189'5413606 mlcod 15189'5413606 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.552 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413638 (15176'5410574,15189'5413638] local-lis/les=14894/14895 n=294316 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413636 crt=15189'5413638 lcod 15189'5413634 mlcod 15189'5413634 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.560 7f696de61700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413640 (15176'5410574,15189'5413640] local-lis/les=14894/14895 n=294314 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 crt=15189'5413640 lcod 15189'5413638 mlcod 15189'5413638 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.576 7f6969e59700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413654 (15176'5410574,15189'5413654] local-lis/les=14894/14895 n=294312 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 luod=15189'5413652 crt=15189'5413654 lcod 15189'5413650 mlcod 15189'5413650 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
Feb 11 00:00:10 proxmox07 ceph-osd[2874332]: 2021-02-11 00:00:10.664 7f6969e59700 -1 osd.5 pg_epoch: 15189 pg[2.1e( v 15189'5413704 (15176'5410674,15189'5413704] local-lis/les=14894/14895 n=294308 ec=131/131 lis/c 14894/14894 les/c/f 14895/14895/0 14894/14894/14313) [5,13,20] r=0 lpr=14894 crt=15189'5413704 lcod 15189'5413702 mlcod 15189'5413702 active+clean+snaptrim trimq=[fd~1,34f~1,3b7~2] ps=118] removing snap head
 
Last edited:
One of the six remaining PGs finished today... and again two OSDs hat 100% Read IO usage as soon as the PG finished.
Could it be that the PG is deep scrubbed afterwards?

Yes, i can check them, but what should I be looking for???
To see if there is any indication to why those PGs are remapped.
Looking into an ACTING PG OSD's log does not show anything interessting to me...
The logs have usually more messages, then the journal. If nothing useful is there, then you might need to bump up the logging.
 
Deep scrubbing is not happening at all at the moment - and probably not untill this is finished...
Code:
()
9 pgs not deep-scrubbed in time

pg 4.76f not deep-scrubbed since 2021-01-31 13:12:33.368818
pg 4.72f not deep-scrubbed since 2021-01-31 16:48:10.470599
pg 4.649 not deep-scrubbed since 2021-01-31 16:22:52.007030
pg 4.8f not deep-scrubbed since 2021-01-31 13:13:26.779725
pg 4.3cf not deep-scrubbed since 2021-01-31 11:02:19.086343
pg 4.43b not deep-scrubbed since 2021-01-31 14:07:55.351858
pg 4.4e7 not deep-scrubbed since 2021-01-31 16:12:24.098696
pg 4.580 not deep-scrubbed since 2021-01-31 16:37:59.494733
pg 4.5a2 not deep-scrubbed since 2021-01-31 14:54:48.741502
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!