[SOLVED] CEPH Reef osd still shutdown

flotho

Renowned Member
Sep 3, 2012
81
4
73
Hi everyone,

I'm working with a 3 node cluster running with ceph 17 and I'm about to upgrade.
I also add a new node to the cluster and install ceph 18.2 .
The first OSD i'm creating seems OK yet after a few moments it's shut down. In the logs here is what I can find :
Code:
May 18 15:34:44 node4 ceph-osd[18535]: 2024-05-18T15:34:44.690+0000 7344afa006c0 -1 osd.4 34571 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:35:10 node4 ceph-osd[18535]: 2024-05-18T15:35:10.513+0000 7344afa006c0 -1 osd.4 34599 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:35:36 node4 ceph-osd[18535]: 2024-05-18T15:35:36.725+0000 7344afa006c0 -1 osd.4 34606 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:36:04 node4 ceph-osd[18535]: 2024-05-18T15:36:04.815+0000 7344afa006c0 -1 osd.4 34612 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:36:35 node4 ceph-osd[18535]: 2024-05-18T15:36:35.811+0000 7344afa006c0 -1 osd.4 34617 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:37:07 node4 ceph-osd[18535]: 2024-05-18T15:37:07.839+0000 7344afa006c0 -1 osd.4 34621 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.311+0000 7344a5a006c0 -1 osd.4 34625 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 osd.4 34625 *** Got signal Interrupt ***
May 18 15:37:35 node4 ceph-osd[18535]: 2024-05-18T15:37:35.312+0000 7344bae006c0 -1 osd.4 34625 *** Immediate shutdown (osd_fast_shutdown=true) ***
May 18 15:37:36 node4 systemd[1]: ceph-osd@4.service: Deactivated successfully.
May 18 15:37:36 node4 systemd[1]: ceph-osd@4.service: Consumed 35.013s CPU time.

Browsing the forum leads me to an indication that the message should be harmless (https://forum.proxmox.com/threads/c...-identify-public-interface.58239/#post-268689 and https://github.com/rook/rook/issues/4374) yet as we can see, raising 6 times the message cause the OSD shut down.
I've also tried to donwgrde to 18.1 ceph installation as proposed here https://forum.proxmox.com/threads/a...2-2-each-osds-never-start.144621/#post-651398
The message still raised and the OSD is shut down
So I'm wondering how to avoid this message?

Any help / advice would be appreciated.

Regards
 
It's not related.

set_numa_affinity is done once at osd service start.

it's seem than you're osd is restarting multiple is loop, then after 5 restart, it's going in protection to avoir infinite loop and impact on the cluster.

do you have logs in /var/log/ceph/ceph-osd.*.log ?
 
Here is thanks for your support @spirit
Here are the logs that looks significant to me :
Code:
y

2024-05-18T16:18:18.292+0000 71872b0006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:18:18.292+0000 71872b0006c0  0 log_channel(cluster) log [DBG] : map e34813 wrongly marked me down at e34813
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 start_waiting_for_healthy
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:18:18.292+0000 71872b0006c0  1 osd.4 34813 not healthy; waiting to boot
2024-05-18T16:18:18.576+0000 7187378006c0  1 osd.4 34813 start_boot
2024-05-18T16:18:18.577+0000 7187350006c0  1 osd.4 34813 set_numa_affinity storage numa node 0
2024-05-18T16:18:18.577+0000 7187350006c0 -1 osd.4 34813 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:18:18.577+0000 7187350006c0  1 osd.4 34813 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:18:18.577+0000 7187350006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:18:18.578+0000 7187350006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:18:19.318+0000 71872b0006c0  1 osd.4 34814 state: booting -> active

and for the record I aslo put the result of the numa status :
Code:
ceph osd numa-status
OSD  HOST   NETWORK  STORAGE  AFFINITY  CPUS
  0  pve10        -        0         -  -   
  1  pve10        -        0         -  -   
  2  pve11        -        0         -  -   
  3  pve11        -        0         -  -   
  4  pve14        -        0         -  -   
  6  pve12        -        0         -  -   
  7  pve12        -        0         -  -

because it looks like so different as https://forum.proxmox.com/threads/fragen-zu-ceph-nach-upgrade-5-4-auf-6-0.56209/#post-258982
 
and more
Code:
lis/c=34622/34620 les/c/f=34623/34621/0 sis=34629) [6,0]/[6,1] r=-1 lpr=34831 pi=[34314,34629)/1 crt=34623'14369446 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-05-18T16:20:43.607+0000 71872b0006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:20:43.607+0000 71872b0006c0  0 log_channel(cluster) log [DBG] : map e34832 wrongly marked me down at e34832
2024-05-18T16:20:43.607+0000 71872b0006c0 -1 osd.4 34832 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2024-05-18T16:20:43.607+0000 71872b0006c0  1 osd.4 34832 start_waiting_for_healthy
2024-05-18T16:20:43.607+0000 71872b0006c0  0 osd.4 34832 _committed_osd_maps shutdown OSD via async signal
2024-05-18T16:20:43.607+0000 7187404006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2024-05-18T16:20:43.607+0000 7187404006c0 -1 osd.4 34832 *** Got signal Interrupt ***
2024-05-18T16:20:43.607+0000 7187404006c0  0 osd.4 34832 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1
2024-05-18T16:20:43.607+0000 7187404006c0 -1 osd.4 34832 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-05-18T16:20:43.607+0000 7187404006c0  0 osd.4 34832 prepare_to_stop starting shutdown
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  allocation stats probe 0: cnt: 140 frags: 140 size: 860160
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -1: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -2: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -4: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -8: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -16: 0,  0, 0
2024-05-18T16:20:43.608+0000 718730a006c0  0 bluestore(/var/lib/ceph/osd/ceph-4) ------------
2024-05-18T16:20:43.608+0000 7187404006c0  4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-05-18T16:20:43.609+0000 7187404006c0  4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-05-18T16:20:43.637+0000 7187404006c0  1 bluefs umount
2024-05-18T16:20:43.637+0000 7187404006c0  1 bdev(0x56f14606b180 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:20:43.908+0000 7187404006c0  1 freelist shutdown
2024-05-18T16:20:43.908+0000 7187404006c0  1 bdev(0x56f14606ae00 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration total     :0.520355 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration osd_drain :0.000341 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration umount    :0.519904 seconds
2024-05-18T16:20:44.128+0000 7187404006c0  0 osd.4 34832 Fast Shutdown duration timer     :0.000051 seconds
 
and all the logs between the osd start and stop :
Code:
osd.4 pg_epoch: 34842 pg[2.1d7( v 34623'14234948 (34483'14233294,34623'14234948] lb MIN local-lis/les=34622/34623 n=0 ec=48/48 lis/c=34622/34620 les/c/f=34623/34621/0 sis=34838) [7,1] r=-1 lpr=34842 pi=[33949,34838)/1 crt=34623'14234948 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [7,1] -> [7,1], acting [7,1] -> [7,1], acting_primary 7 -> 7, up_primary 7 -> 7, role -1 -> -1, features acting 4540138320759226367 upacting 4540138320759226367
2024-05-18T16:25:36.602+0000 72d8a22006c0  1 osd.4 pg_epoch: 34842 pg[2.1d7( v 34623'14234948 (34483'14233294,34623'14234948] lb MIN local-lis/les=34622/34623 n=0 ec=48/48 lis/c=34622/34620 les/c/f=34623/34621/0 sis=34838) [7,1] r=-1 lpr=34842 pi=[33949,34838)/1 crt=34623'14234948 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-05-18T16:26:04.034+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:04.034+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34843 wrongly marked me down at e34843
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 start_waiting_for_healthy
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:04.034+0000 72d8b1c006c0  1 osd.4 34843 not healthy; waiting to boot
2024-05-18T16:26:04.740+0000 72d8be4006c0  1 osd.4 34844 start_boot
2024-05-18T16:26:04.741+0000 72d8bbc006c0  1 osd.4 34844 set_numa_affinity storage numa node 0
2024-05-18T16:26:04.741+0000 72d8bbc006c0 -1 osd.4 34844 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:04.741+0000 72d8bbc006c0  1 osd.4 34844 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:04.742+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:04.742+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:05.609+0000 72d8b1c006c0  1 osd.4 34845 state: booting -> active
2024-05-18T16:26:33.837+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:33.837+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34846 wrongly marked me down at e34846
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 start_waiting_for_healthy
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:33.837+0000 72d8b1c006c0  1 osd.4 34846 not healthy; waiting to boot
2024-05-18T16:26:33.903+0000 72d8be4006c0  1 osd.4 34846 start_boot
2024-05-18T16:26:33.904+0000 72d8bbc006c0  1 osd.4 34846 set_numa_affinity storage numa node 0
2024-05-18T16:26:33.904+0000 72d8bbc006c0 -1 osd.4 34846 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:33.904+0000 72d8bbc006c0  1 osd.4 34846 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:33.905+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:33.905+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:34.576+0000 72d8b1c006c0  1 osd.4 34847 state: booting -> active
2024-05-18T16:26:58.839+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:26:58.839+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34848 wrongly marked me down at e34848
2024-05-18T16:26:58.839+0000 72d8b1c006c0  1 osd.4 34848 start_waiting_for_healthy
2024-05-18T16:26:58.840+0000 72d8b1c006c0  1 osd.4 34848 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:26:58.840+0000 72d8b1c006c0  1 osd.4 34848 not healthy; waiting to boot
2024-05-18T16:26:58.901+0000 72d8be4006c0  1 osd.4 34848 start_boot
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 osd.4 34848 set_numa_affinity storage numa node 0
2024-05-18T16:26:58.902+0000 72d8bbc006c0 -1 osd.4 34848 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 osd.4 34848 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:26:58.902+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:58.903+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:26:59.398+0000 72d8b1c006c0  1 osd.4 34849 state: booting -> active
2024-05-18T16:27:26.342+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:27:26.342+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34850 wrongly marked me down at e34850
2024-05-18T16:27:26.342+0000 72d8b1c006c0  1 osd.4 34850 start_waiting_for_healthy
2024-05-18T16:27:26.343+0000 72d8b1c006c0  1 osd.4 34850 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:27:26.343+0000 72d8b1c006c0  1 osd.4 34850 not healthy; waiting to boot
2024-05-18T16:27:26.703+0000 72d8be4006c0  1 osd.4 34850 start_boot
2024-05-18T16:27:26.703+0000 72d8bbc006c0  1 osd.4 34850 set_numa_affinity storage numa node 0
2024-05-18T16:27:26.703+0000 72d8bbc006c0 -1 osd.4 34850 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 osd.4 34850 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:26.704+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:27.358+0000 72d8b1c006c0  1 osd.4 34851 state: booting -> active
2024-05-18T16:27:51.444+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:27:51.444+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34852 wrongly marked me down at e34852
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 start_waiting_for_healthy
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 is_healthy false -- only 0/6 up peers (less than 33%)
2024-05-18T16:27:51.444+0000 72d8b1c006c0  1 osd.4 34852 not healthy; waiting to boot
2024-05-18T16:27:51.580+0000 72d8be4006c0  1 osd.4 34852 start_boot
2024-05-18T16:27:51.581+0000 72d8bbc006c0  1 osd.4 34852 set_numa_affinity storage numa node 0
2024-05-18T16:27:51.581+0000 72d8bbc006c0 -1 osd.4 34852 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2024-05-18T16:27:51.581+0000 72d8bbc006c0  1 osd.4 34852 set_numa_affinity setting numa affinity to node 0 cpus 0-5,24-29
2024-05-18T16:27:51.582+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:51.582+0000 72d8bbc006c0  1 bluestore(/var/lib/ceph/osd/ceph-4) collect_metadata devices span numa nodes 0
2024-05-18T16:27:52.245+0000 72d8b1c006c0  1 osd.4 34853 state: booting -> active
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 log_channel(cluster) log [DBG] : map e34854 wrongly marked me down at e34854
2024-05-18T16:28:16.348+0000 72d8b1c006c0 -1 osd.4 34854 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2024-05-18T16:28:16.348+0000 72d8b1c006c0  1 osd.4 34854 start_waiting_for_healthy
2024-05-18T16:28:16.348+0000 72d8b1c006c0  0 osd.4 34854 _committed_osd_maps shutdown OSD via async signal
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 osd.4 34854 *** Got signal Interrupt ***
2024-05-18T16:28:16.349+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1
2024-05-18T16:28:16.349+0000 72d8c70006c0 -1 osd.4 34854 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-05-18T16:28:16.349+0000 72d8c70006c0  0 osd.4 34854 prepare_to_stop starting shutdown
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  allocation stats probe 0: cnt: 44 frags: 44 size: 270336
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -1: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -2: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -4: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -8: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4)  probe -16: 0,  0, 0
2024-05-18T16:28:16.349+0000 72d8b76006c0  0 bluestore(/var/lib/ceph/osd/ceph-4) ------------
2024-05-18T16:28:16.349+0000 72d8c70006c0  4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-05-18T16:28:16.350+0000 72d8c70006c0  4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-05-18T16:28:16.361+0000 72d8c70006c0  1 bluefs umount
2024-05-18T16:28:16.361+0000 72d8c70006c0  1 bdev(0x5fe4a17e5180 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:28:16.623+0000 72d8c70006c0  1 freelist shutdown
2024-05-18T16:28:16.623+0000 72d8c70006c0  1 bdev(0x5fe4a17e4e00 /var/lib/ceph/osd/ceph-4/block) close
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration total     :0.514841 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration osd_drain :0.000334 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration umount    :0.514394 seconds
2024-05-18T16:28:16.863+0000 72d8c70006c0  0 osd.4 34854 Fast Shutdown duration timer     :0.000055 seconds
 
I don't think it's reated to numa. I have some virtual ceph cluster, where numa is not present, I have exactly same warning message , and ceph osd numa-status is also empty, and everything is working fine.


Maybe you could try to increase debug level in ceph.conf : debug_osd = 20 for example.
to have more logs.

Are your sure that you don't have any firewall between osd && monitors ?
 
I don't think it's reated to numa. I have some virtual ceph cluster, where numa is not present, I have exactly same warning message , and ceph osd numa-status is also empty, and everything is working fine.


Maybe you could try to increase debug level in ceph.conf : debug_osd = 20 for example.
to have more logs.

Are your sure that you don't have any firewall between osd && monitors ?
thanks, I'll check this ASAP
 
Hum...
Firewall on the new node seems to have a bad setup.
I'm monitoring the new setup
And will add more logs anywy to be sure.