[SOLVED] CEPH Reef osd still shutdown

flotho · May 18, 2024

Hi everyone,

I'm working with a 3 node cluster running with ceph 17 and I'm about to upgrade.
I also add a new node to the cluster and install ceph 18.2 .
The first OSD i'm creating seems OK yet after a few moments it's shut down. In the logs here is what I can find :

Code:

May 18 15:34:44 node4 ceph-osd[18535]: 2024-05-18T15:34:44.690+0000 7344afa006c0 -1 osd.4 34571 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:35:10 node4 ceph-osd[18535]: 2024-05-18T15:35:10.513+0000 7344afa006c0 -1 osd.4 34599 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:35:36 node4 ceph-osd[18535]: 2024-05-18T15:35:36.725+0000 7344afa006c0 -1 osd.4 34606 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:36:04 node4 ceph-osd[18535]: 2024-05-18T15:36:04.815+0000 7344afa006c0 -1 osd.4 34612 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:36:35 node4 ceph-osd[18535]: 2024-05-18T15:36:35.811+0000 7344afa006c0 -1 osd.4 34617 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:37:07 node4 ceph-osd[18535]: 2024-05-18T15:37:07.839+0000 7344afa006c0 -1 osd.4 34621 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.311+0000 7344a5a006c0 -1 osd.4 34625 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 osd.4 34625 *** Got signal Interrupt ***
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 osd.4 34625 *** Immediate shutdown (osd_fast_shutdown=true) ***
May 18 15:37:36 node4 systemd[1]: ceph-osd@4.service: Deactivated successfully.
May 18 15:37:36 node4 systemd[1]: ceph-osd@4.service: Consumed 35.013s CPU time.

Browsing the forum leads me to an indication that the message should be harmless (https://forum.proxmox.com/threads/c...-identify-public-interface.58239/#post-268689 and https://github.com/rook/rook/issues/4374) yet as we can see, raising 6 times the message cause the OSD shut down.
I've also tried to donwgrde to 18.1 ceph installation as proposed here https://forum.proxmox.com/threads/a...2-2-each-osds-never-start.144621/#post-651398
The message still raised and the OSD is shut down
So I'm wondering how to avoid this message?

Any help / advice would be appreciated.

Regards

flotho · May 18, 2024

hum....
seems to be also referenced here https://tracker.ceph.com/issues/43417

spirit · May 18, 2024

It's not related.

set_numa_affinity is done once at osd service start.

it's seem than you're osd is restarting multiple is loop, then after 5 restart, it's going in protection to avoir infinite loop and impact on the cluster.

do you have logs in /var/log/ceph/ceph-osd.*.log ?

flotho · May 18, 2024

Here is thanks for your support @spirit
Here are the logs that looks significant to me :

Code:

y

2024-05-18T16:18:18.292+0000 71872b0006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:18:18.292+0000 71872b0006c0  0 log_channel(cluster) log [DBG] : map e34813 wrongly marked me down at e34813
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 start_waiting_for_healthy
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 not healthy; waiting to boot
2024-05-18T16:18:18.576+0000 7187378006c0  1 osd.4 34813 start_boot
2024-05-18T16:18:18.577+0000 7187350006c0  1 osd.4 34813 set_numa_affinity storage numa node 0
2024-05-18T16:18:18.577+0000 7187350006c0 -1 osd.4 34813 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:18:18.577+0000 7187350006c0  1 osd.4 34813 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:18:18.577+0000 7187350006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:18:18.578+0000 7187350006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:18:19.318+0000 71872b0006c0  1 osd.4 34814 state: booting -> active

and for the record I aslo put the result of the numa status :

Code:

ceph osd numa-status
OSD  HOST   NETWORK  STORAGE  AFFINITY  CPUS
  0  pve10        -        0         -  -   
  1  pve10        -        0         -  -   
  2  pve11        -        0         -  -   
  3  pve11        -        0         -  -   
  4  pve14        -        0         -  -   
  6  pve12        -        0         -  -   
  7  pve12        -        0         -  -

because it looks like so different as https://forum.proxmox.com/threads/fragen-zu-ceph-nach-upgrade-5-4-auf-6-0.56209/#post-258982

flotho · May 18, 2024

and more

Code:

lis/c=34622/34620 les/c/f=34623/34621/0 sis=34629) [6,0]/[6,1] r=-1 lpr=34831 pi=[34314,34629)/1 crt=34623'14369446 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-05-18T16:20:43.607+0000 71872b0006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:20:43.607+0000 71872b0006c0  0 log_channel(cluster) log [DBG] : map e34832 wrongly marked me down at e34832
2024-05-18T16:20:43.607+0000 71872b0006c0 -1 osd.4 34832 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2024-05-18T16:20:43.607+0000 71872b0006c0  1 osd.4 34832 start_waiting_for_healthy
2024-05-18T16:20:43.607+0000 71872b0006c0  0 osd.4 34832 _committed_osd_maps shutdown OSD via async signal
2024-05-18T16:20:43.607+0000 7187404006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2024-05-18T16:20:43.607+0000 7187404006c0 -1 osd.4 34832 *** Got signal Interrupt ***
2024-05-18T16:20:43.607+0000 7187404006c0  0 osd.4 34832 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1
2024-05-18T16:20:43.607+0000 7187404006c0 -1 osd.4 34832 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-05-18T16:20:43.607+0000 7187404006c0  0 osd.4 34832 prepare_to_stop starting shutdown
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  allocation stats probe 0: cnt: 140 frags: 140 size: 860160
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -1: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -2: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -4: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -8: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -16: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4) ------------
2024-05-18T16:20:43.608+0000 7187404006c0  4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-05-18T16:20:43.609+0000 7187404006c0  4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-05-18T16:20:43.637+0000 7187404006c0  1 bluefs umount
2024-05-18T16:20:43.637+0000 7187404006c0  1 bdev(0x56f14606b180 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:20:43.908+0000 7187404006c0  1 freelist shutdown
2024-05-18T16:20:43.908+0000 7187404006c0  1 bdev(0x56f14606ae00 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration total     :0.520355 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration osd_drain :0.000341 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration umount    :0.519904 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration timer     :0.000051 seconds

flotho · May 18, 2024

and all the logs between the osd start and stop :

Code:

osd.4 pg_epoch: 34842 pg[2.1d7( v 34623'14234948 (34483'14233294,34623'14234948] lb MIN local-lis/les=34622/34623 n=0 ec=48/48 lis/c=34622/34620 les/c/f=34623/34621/0 sis=34838) [7,1] r=-1 lpr=34842 pi=[33949,34838)/1 crt=34623'14234948 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [7,1] -> [7,1], acting [7,1] -> [7,1], acting_primary 7 -> 7, up_primary 7 -> 7, role -1 -> -1, features acting 4540138320759226367 upacting 4540138320759226367
2024-05-18T16:25:36.602+0000 72d8a22006c0  1 osd.4 pg_epoch: 34842 pg[2.1d7( v 34623'14234948 (34483'14233294,34623'14234948] lb MIN local-lis/les=34622/34623 n=0 ec=48/48 lis/c=34622/34620 les/c/f=34623/34621/0 sis=34838) [7,1] r=-1 lpr=34842 pi=[33949,34838)/1 crt=34623'14234948 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-05-18T16:26:04.034+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:04.034+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34843 wrongly marked me down at e34843
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 start_waiting_for_healthy
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 not healthy; waiting to boot
2024-05-18T16:26:04.740+0000 72d8be4006c0  1 osd.4 34844 start_boot
2024-05-18T16:26:04.741+0000 72d8bbc006c0  1 osd.4 34844 set_numa_affinity storage numa node 0
2024-05-18T16:26:04.741+0000 72d8bbc006c0 -1 osd.4 34844 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:04.741+0000 72d8bbc006c0  1 osd.4 34844 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:04.742+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:04.742+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:05.609+0000 72d8b1c006c0  1 osd.4 34845 state: booting -> active
2024-05-18T16:26:33.837+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:33.837+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34846 wrongly marked me down at e34846
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 start_waiting_for_healthy
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 not healthy; waiting to boot
2024-05-18T16:26:33.903+0000 72d8be4006c0  1 osd.4 34846 start_boot
2024-05-18T16:26:33.904+0000 72d8bbc006c0  1 osd.4 34846 set_numa_affinity storage numa node 0
2024-05-18T16:26:33.904+0000 72d8bbc006c0 -1 osd.4 34846 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:33.904+0000 72d8bbc006c0  1 osd.4 34846 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:33.905+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:33.905+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:34.576+0000 72d8b1c006c0  1 osd.4 34847 state: booting -> active
2024-05-18T16:26:58.839+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:58.839+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34848 wrongly marked me down at e34848
2024-05-18T16:26:58.839+0000 72d8b1c006c0  1 osd.4 34848 start_waiting_for_healthy
2024-05-18T16:26:58.840+0000 72d8b1c006c0  1 osd.4 34848 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:58.840+0000 72d8b1c006c0  1 osd.4 34848 not healthy; waiting to boot
2024-05-18T16:26:58.901+0000 72d8be4006c0  1 osd.4 34848 start_boot
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 osd.4 34848 set_numa_affinity storage numa node 0
2024-05-18T16:26:58.902+0000 72d8bbc006c0 -1 osd.4 34848 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 osd.4 34848 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:58.903+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:59.398+0000 72d8b1c006c0  1 osd.4 34849 state: booting -> active
2024-05-18T16:27:26.342+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:27:26.342+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34850 wrongly marked me down at e34850
2024-05-18T16:27:26.342+0000 72d8b1c006c0  1 osd.4 34850 start_waiting_for_healthy
2024-05-18T16:27:26.343+0000 72d8b1c006c0  1 osd.4 34850 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:27:26.343+0000 72d8b1c006c0  1 osd.4 34850 not healthy; waiting to boot
2024-05-18T16:27:26.703+0000 72d8be4006c0  1 osd.4 34850 start_boot
2024-05-18T16:27:26.703+0000 72d8bbc006c0  1 osd.4 34850 set_numa_affinity storage numa node 0
2024-05-18T16:27:26.703+0000 72d8bbc006c0 -1 osd.4 34850 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 osd.4 34850 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:27.358+0000 72d8b1c006c0  1 osd.4 34851 state: booting -> active
2024-05-18T16:27:51.444+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:27:51.444+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34852 wrongly marked me down at e34852
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 start_waiting_for_healthy
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 not healthy; waiting to boot
2024-05-18T16:27:51.580+0000 72d8be4006c0  1 osd.4 34852 start_boot
2024-05-18T16:27:51.581+0000 72d8bbc006c0  1 osd.4 34852 set_numa_affinity storage numa node 0
2024-05-18T16:27:51.581+0000 72d8bbc006c0 -1 osd.4 34852 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:27:51.581+0000 72d8bbc006c0  1 osd.4 34852 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:27:51.582+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:51.582+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:52.245+0000 72d8b1c006c0  1 osd.4 34853 state: booting -> active
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34854 wrongly marked me down at e34854
2024-05-18T16:28:16.348+0000 72d8b1c006c0 -1 osd.4 34854 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2024-05-18T16:28:16.348+0000 72d8b1c006c0  1 osd.4 34854 start_waiting_for_healthy
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 osd.4 34854 _committed_osd_maps shutdown OSD via async signal
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 osd.4 34854 *** Got signal Interrupt ***
2024-05-18T16:28:16.349+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 osd.4 34854 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-05-18T16:28:16.349+0000 72d8c70006c0  0 osd.4 34854 prepare_to_stop starting shutdown
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  allocation stats probe 0: cnt: 44 frags: 44 size: 270336
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -1: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -2: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -4: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -8: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -16: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4) ------------
2024-05-18T16:28:16.349+0000 72d8c70006c0  4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-05-18T16:28:16.350+0000 72d8c70006c0  4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-05-18T16:28:16.361+0000 72d8c70006c0  1 bluefs umount
2024-05-18T16:28:16.361+0000 72d8c70006c0  1 bdev(0x5fe4a17e5180 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:28:16.623+0000 72d8c70006c0  1 freelist shutdown
2024-05-18T16:28:16.623+0000 72d8c70006c0  1 bdev(0x5fe4a17e4e00 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration total     :0.514841 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration osd_drain :0.000334 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration umount    :0.514394 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration timer     :0.000055 seconds

spirit · May 19, 2024

I don't think it's reated to numa. I have some virtual ceph cluster, where numa is not present, I have exactly same warning message , and ceph osd numa-status is also empty, and everything is working fine.

Maybe you could try to increase debug level in ceph.conf : debug_osd = 20 for example.
to have more logs.

Are your sure that you don't have any firewall between osd && monitors ?

flotho · May 19, 2024

spirit said:
I don't think it's reated to numa. I have some virtual ceph cluster, where numa is not present, I have exactly same warning message , and ceph osd numa-status is also empty, and everything is working fine.

Maybe you could try to increase debug level in ceph.conf : debug_osd = 20 for example.
to have more logs.

Are your sure that you don't have any firewall between osd && monitors ?

thanks, I'll check this ASAP

flotho · May 20, 2024

Hum...
Firewall on the new node seems to have a bad setup.
I'm monitoring the new setup
And will add more logs anywy to be sure.

flotho · May 21, 2024

I confirm that the firewall was responsible of this issue

[SOLVED] CEPH Reef osd still shutdown

flotho

Renowned Member

flotho

Renowned Member

spirit

Distinguished Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

spirit

Distinguished Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

We value your privacy