[SOLVED] ceph health ok, but 1 active+clean+scrubbing+deep

lifeboy

Renowned Member
After updating to ceph 16 (pacific) on pve7, I have the following condition:

Bash:
~# ceph health detail
HEALTH_OK

but

Bash:
~# ceph status
  cluster:
    id:     04385b88-049f-4083-8d5a-6c45a0b7bddb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum FT1-NodeA,FT1-NodeB,FT1-NodeC (age 13h)
    mgr: FT1-NodeC(active, since 13h), standbys: FT1-NodeA, FT1-NodeB
    mds: 1/1 daemons up, 2 standby
    osd: 24 osds: 24 up (since 13h), 24 in (since 3w)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 337 pgs
    objects: 1.93M objects, 7.1 TiB
    usage:   19 TiB used, 18 TiB / 37 TiB avail
    pgs:     336 active+clean
             1   active+clean+scrubbing+deep
 
  io:
    client:   181 KiB/s rd, 5.0 MiB/s wr, 6 op/s rd, 217 op/s wr

So I can't use the normal ceph repair or similar tools to fix this. What can I do to fix this? It's been like this for more than a day now.
 
Ceph will periodically deep scrub PGs. This is nothing to worry about.
You can check the status of PGs with ceph pg [0], and it should also be logged when it starts deep scrubbing a PG and when it finishes (/var/log/ceph/ceph.log).


[0] https://docs.ceph.com/en/latest/man/8/ceph/
 
Hi,

yesterday morning I updated my 5 nodes cluster from Proxmox 7.1-7 to 7.1-12 following this steps:
Code:
1. Set noout, noscrub and nodeep-scrub before start the update process;
2. I have updated all 5 nodes without problems;
3. Unset the flags noout, noscrub and nodeep-scrub

I have 2 pools, one for NVMe disks and one for SSD disks, ceph version is 16.2.7. Ceph health is ok, but the output of cheph status is:
Code:
root@prx-a1-1:~# ceph status
  cluster:
    id:     6cddec54-f21f-4261-b8bd-b475e64bd3e3
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum prx-a1-1,prx-a1-2,prx-a1-3 (age 30h)
    mgr: prx-a1-2(active, since 30h), standbys: prx-a1-1, prx-a1-3
    osd: 26 osds: 26 up (since 30h), 26 in (since 8M)
 
  data:
    pools:   3 pools, 2049 pgs
    objects: 1.04M objects, 3.9 TiB
    usage:   12 TiB used, 70 TiB / 82 TiB avail
    pgs:     2048 active+clean
             1    active+clean+scrubbing+deep
 
  io:
    client:   18 KiB/s rd, 8.7 MiB/s wr, 2 op/s rd, 163 op/s wr

The output of pveversion -v is:

Code:
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_ADDRESS = "it_IT.UTF-8",
    LC_NAME = "it_IT.UTF-8",
    LC_MONETARY = "it_IT.UTF-8",
    LC_PAPER = "it_IT.UTF-8",
    LC_IDENTIFICATION = "it_IT.UTF-8",
    LC_TELEPHONE = "it_IT.UTF-8",
    LC_MEASUREMENT = "it_IT.UTF-8",
    LC_TIME = "it_IT.UTF-8",
    LC_NUMERIC = "it_IT.UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").

proxmox-ve: 7.1-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3)
pve-kernel-helper: 7.1-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-7
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

the ceph log file ( /var/log/ceph/ceph.log is full of:

Code:
2022-04-18T17:00:00.000074+0200 mon.prx-a1-1 (mon.0) 53409 : cluster [INF] overall HEALTH_OK
2022-04-18T17:00:00.786265+0200 mgr.prx-a1-2 (mgr.131256291) 55363 : cluster [DBG] pgmap v55438: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 11 KiB/s rd, 7.1 MiB/s wr, 124 op/s
2022-04-18T17:00:02.787596+0200 mgr.prx-a1-2 (mgr.131256291) 55364 : cluster [DBG] pgmap v55439: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 97 KiB/s rd, 5.0 MiB/s wr, 96 op/s
2022-04-18T17:00:04.791164+0200 mgr.prx-a1-2 (mgr.131256291) 55365 : cluster [DBG] pgmap v55440: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 484 KiB/s rd, 6.9 MiB/s wr, 164 op/s
2022-04-18T17:00:06.794773+0200 mgr.prx-a1-2 (mgr.131256291) 55366 : cluster [DBG] pgmap v55441: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 1.1 MiB/s rd, 11 MiB/s wr, 296 op/s
2022-04-18T17:00:08.796209+0200 mgr.prx-a1-2 (mgr.131256291) 55367 : cluster [DBG] pgmap v55442: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 1.1 MiB/s rd, 8.8 MiB/s wr, 286 op/s
2022-04-18T17:00:10.802137+0200 mgr.prx-a1-2 (mgr.131256291) 55368 : cluster [DBG] pgmap v55443: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 1.1 MiB/s rd, 12 MiB/s wr, 396 op/s
2022-04-18T17:00:12.803731+0200 mgr.prx-a1-2 (mgr.131256291) 55369 : cluster [DBG] pgmap v55444: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 1.1 MiB/s rd, 11 MiB/s wr, 377 op/s
2022-04-18T17:00:14.807279+0200 mgr.prx-a1-2 (mgr.131256291) 55370 : cluster [DBG] pgmap v55445: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 1.0 MiB/s rd, 12 MiB/s wr, 397 op/s
2022-04-18T17:00:16.810957+0200 mgr.prx-a1-2 (mgr.131256291) 55371 : cluster [DBG] pgmap v55446: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 695 KiB/s rd, 13 MiB/s wr, 362 op/s
2022-04-18T17:00:18.812098+0200 mgr.prx-a1-2 (mgr.131256291) 55372 : cluster [DBG] pgmap v55447: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 91 KiB/s rd, 8.2 MiB/s wr, 217 op/s
2022-04-18T17:00:20.817472+0200 mgr.prx-a1-2 (mgr.131256291) 55373 : cluster [DBG] pgmap v55448: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 108 KiB/s rd, 11 MiB/s wr, 272 op/s
2022-04-18T17:00:22.819004+0200 mgr.prx-a1-2 (mgr.131256291) 55374 : cluster [DBG] pgmap v55449: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 54 KiB/s rd, 7.2 MiB/s wr, 164 op/s
2022-04-18T17:00:24.823442+0200 mgr.prx-a1-2 (mgr.131256291) 55375 : cluster [DBG] pgmap v55450: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 53 KiB/s rd, 7.7 MiB/s wr, 180 op/s
2022-04-18T17:00:26.826856+0200 mgr.prx-a1-2 (mgr.131256291) 55376 : cluster [DBG] pgmap v55451: 2049 pgs: 1 active+clean+scrubbing+deep, 2048 active+clean; 3.9 TiB data, 12 TiB used, 70 TiB / 82 TiB avail; 34 KiB/s rd, 8.2 MiB/s wr, 173 op/s

And every 1 or 2 seconds a new line is logged...

It has now been about 30 hours since update ended, do I have to worry?

Thank you
 
The problem has not been resolved yet, and the /var/log/ceph/ceph.log is always full of messages as said in my previous post...

Could someone help me, please?

Thank you
 
Ceph will periodically deep scrub PGs. This is nothing to worry about.
You can check the status of PGs with ceph pg [0], and it should also be logged when it starts deep scrubbing a PG and when it finishes (/var/log/ceph/ceph.log).


[0] https://docs.ceph.com/en/latest/man/8/ceph/
I'll quote my previous post.

This is typical for Ceph. It periodically (deep) scrubs its PGs. This is nothing to worry about.
Check with the above command which PG is currently being deep scrubbed.
 
Hi,

have used the command ceph pg but this command is incomplete, the output is:
Code:
no valid command found; 10 closest matches:
pg stat
pg getmap
pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
pg dump_json [all|summary|sum|pools|osds|pgs...]
pg dump_pools_json
pg ls-by-pool <poolstr> [<states>...]
pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
pg ls [<pool:int>] [<states>...]
pg dump_stuck [inactive|unclean|stale|undersized|degraded...] [<threshold:int>]
Error EINVAL: invalid command

Thank you
 
Hi,
just as a suggestion:
check the PG and OSDs involved with "ceph health detail" and "ceph pg dump pgs_brief | egrep <pg_id>", try to restart one OSD service at a time, checking if the deep scrub unlocks and continues on the other PGs. The alarms should clear up.

Best regards
 
  • Like
Reactions: zeuxprox

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!