[SOLVED] after upgrade to 9, one node has ceph fail

RobFantini

Famous Member
May 24, 2012
2,080
116
133
Boston,Mass
I noticed osd's down at pve web page. tried to start , fail .

at a not where ceph is up:
Code:
# ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
            noout flag(s) set
            10 osds down
            1 host (10 osds) down
            Degraded data redundancy: 1165923/4919307 objects degraded (23.701%), 92 pgs degraded, 92 pgs undersized
 
  services:
    mon: 3 daemons, quorum pve11,pve4,pve2 (age 25m)
    mgr: pve2(active, since 27m), standbys: pve11, pve4
    osd: 41 osds: 31 up (since 21m), 41 in (since 2w)
         flags noout
 
  data:
    pools:   2 pools, 129 pgs
    objects: 1.64M objects, 6.1 TiB
    usage:   18 TiB used, 131 TiB / 149 TiB avail
    pgs:     1165923/4919307 objects degraded (23.701%)
             92 active+undersized+degraded
             37 active+clean
 
  io:
    client:   3.5 MiB/s rd, 2.5 MiB/s wr, 228 op/s rd, 262 op/s wr

node with ceph down: [ i had to press ctl+c after a minute it was stuck]
Code:
# ceph -s

^CCluster connection aborted

tried this:
Code:
# pveceph install
This will install Ceph 19.2 Squid - continue (y/N)? y
update available package list
start installation
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ceph is already the newest version (19.2.3-pve1).
ceph-common is already the newest version (19.2.3-pve1).
ceph-fuse is already the newest version (19.2.3-pve1).
ceph-mds is already the newest version (19.2.3-pve1).
ceph-volume is already the newest version (19.2.3-pve1).
gdisk is already the newest version (1.0.10-2).
nvme-cli is already the newest version (2.13-2).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

installed Ceph 19.2 Squid successfully!

Any advise to debug and fix?
 
osd's are mounted at bad node:
Code:
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-6
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-2
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-23
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-20
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-15
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-14
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-18
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-22
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-9
tmpfs                  tmpfs     378G   24K  378G   1% /var/lib/ceph/osd/ceph-21
 
I found the cause. bad node can not reach cluster network.. this is the only node that i did not use systemctl aliases for network interface names...
 
to find device name

Code:
1 - plug in cable

2 - dmesg:
[Fri Aug  8 09:19:59 2025] mlx5_core 0000:a8:00.1: Port module event: module 1, Cable plugged

3 -  find device name
ls /sys/bus/pci/devices/0000\:a8\:00.1/net/

and this is a great tool, I ran this first: see manual.
pve-network-interface-pinning generate