[SOLVED] Proxmox VE 9.0.3 Cluster + Ceph19.2.3: Node OSDs stop working after reboot

Aug 4, 2025
12
0
1
Hi all,

I'm new to Proxmox, for many years I've worked with VMware, but I'm start to enjoying Proxmox very much and now I'm migrating all the solutions I manage to Proxmox.
One of those solutions is a VMware cluster that I intend to replace with a Proxmox Cluster.
So, I got 3 servers, with the same resources, to build my first Proxmox Cluster.

Each of the 3 nodes of my cluster has:
  • HPE Proliant DL380 G10
  • 2 x Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
  • 512 GB RAM
  • 3 x SSD Disks 480 GB (RAID1 + Spare) for Proxmox Installation and Execution
  • 3 x SSD Disks 960 GB (no RAID or any kind of disk controller cache) for DB and WAL Disks of Ceph
  • 6 x SSD Disks 1.92 TB (no RAID or any kind of disk controller cache) for Data Disks of Ceph (I'm using 1 x 960 GB SSD DB/WAL disk for each 2 x 1.92 TB SSD Data Disks on OSDs for improved performance)
  • 1 x 4 Port RJ45 Network Adapter (using 1 port for management)
  • 1 x 4 Port SFP+ Network Adapter (using an aggregation of 4 x 10 Gbps SPF+ with a dedicaded switch, to use for vLAN Cluster, vLAN Ceph Public Network and vLAN Ceph Cluster Network)
  • All firmware updated and the exact same version of each of the 3 nodes

So, I've installed Proxmox 8.4 on each node, configured the network, setup a community subscrition, updated the servers, group them into a cluster. Before I installed Ceph, Proxmox VE 9 released oficially, so I decided to upgrade all the 3 nodes using the guide: https://pve.proxmox.com/wiki/Upgrade_from_8_to_9

The upgrade was successfull. I then decided to install and configure Ceph, add all the OSDs, create the pools and test it out.

It worked perfectly and the performance was very good. All cluster vLANs and ping the iperf bechmarks are great.

A good start.

I then decided to make some tests, for resilience. I shutdown node #1 on purpose and after a while, I turn it on, to check how Ceph would recover.

The issue that I discover is that the Ceph OSD services don't start after that long reboot... Of the 6 OSDs, 6 are down and 5 are Out. For some reason one of the OSDs is IN. I've tried to start manually and it doesn't come up. Check logs and weird stuff start to appear...

I've retried to do the entire configuration, because I probably messed up something, but the result was exactly the same. So this is due to some bug (I'm using the latest versions) or the configuration is not well done (the most probable cause).

I will share the Ceph & Network configurations:

It's in the annex file (Node Configuration.txt), since I had a 16k char limit in my post... Sorry for that.


Captura de ecrã 2025-08-09, às 08.49.45.png

Captura de ecrã 2025-08-09, às 08.55.19.png

LVM:

Bash:
root@bo-pmx-vmhost-1:~# lvs
  LV                                             VG                                        Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-64f623f3-e7e0-4427-800d-001d68395752 ceph-23dd2e01-ec30-4b2c-a966-eb6470118a8f -wi-a-----   <1.75t                                                  
  osd-block-6382361f-cd3d-480e-95ae-2795365bc057 ceph-45cf05f7-6932-4e5b-bc05-8ee4e635b91c -wi-a-----   <1.75t                                                  
  osd-db-6ffc290a-2578-4b5e-bfd2-a5ba6b87b039    ceph-70923913-d761-4fad-8bba-58c663b7b572 -wi-------  178.85g                                                  
  osd-db-82369648-6322-4330-bf11-2cdfa7fa7430    ceph-70923913-d761-4fad-8bba-58c663b7b572 -wi-------  178.85g                                                  
  osd-block-8d6509db-5924-4d2e-9f26-87e4f1e61956 ceph-7bde6d27-c854-420b-a67e-9e3a8719d67b -wi-a-----   <1.75t                                                  
  osd-block-71e3207a-e995-4d68-abe7-3fe52ea69b07 ceph-8a347808-e50e-4434-9fde-7fd5ae866145 -wi-a-----   <1.75t                                                  
  osd-db-620cd823-8117-4283-b414-af337c32f0d9    ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c -wi-------  178.85g                                                  
  osd-db-dde1b91d-0d00-40e6-a91c-e7c318c1a06b    ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c -wi-------  178.85g                                                  
  osd-block-859b5141-5a06-4ed5-b52f-d1122e0ea660 ceph-c91ac7c3-a4f1-4fd0-8fcc-4b96db47c43a -wi-a-----   <1.75t                                                  
  osd-block-b1e999dd-29a1-4b60-b465-147bd40ed803 ceph-ced07cd7-ad32-4922-991a-6b4ceb909ea6 -wi-a-----   <1.75t                                                  
  osd-db-0511dba8-b75e-4c67-858e-24070573293e    ceph-e09583f5-e3da-4292-9328-1ff324a51396 -wi-------  178.85g                                                  
  osd-db-54f7018f-104d-445e-abf4-634740c21c43    ceph-e09583f5-e3da-4292-9328-1ff324a51396 -wi-------  178.85g                                                  
  data                                           pve                                       twi-a-tz-- <319.58g             0.00   0.52                          
  root                                           pve                                       -wi-ao----   96.00g                                                  
  swap                                           pve                                       -wi-ao----    8.00g                                                  
root@bo-pmx-vmhost-1:~# vgs
  VG                                        #PV #LV #SN Attr   VSize    VFree  
  ceph-23dd2e01-ec30-4b2c-a966-eb6470118a8f   1   1   0 wz--n-   <1.75t       0
  ceph-45cf05f7-6932-4e5b-bc05-8ee4e635b91c   1   1   0 wz--n-   <1.75t       0
  ceph-70923913-d761-4fad-8bba-58c663b7b572   1   2   0 wz--n-  894.25g <536.55g
  ceph-7bde6d27-c854-420b-a67e-9e3a8719d67b   1   1   0 wz--n-   <1.75t       0
  ceph-8a347808-e50e-4434-9fde-7fd5ae866145   1   1   0 wz--n-   <1.75t       0
  ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c   1   2   0 wz--n-  894.25g <536.55g
  ceph-c91ac7c3-a4f1-4fd0-8fcc-4b96db47c43a   1   1   0 wz--n-   <1.75t       0
  ceph-ced07cd7-ad32-4922-991a-6b4ceb909ea6   1   1   0 wz--n-   <1.75t       0
  ceph-e09583f5-e3da-4292-9328-1ff324a51396   1   2   0 wz--n-  894.25g <536.55g
  pve                                         1   3   0 wz--n- <446.10g   16.00g


So, after node #1 reboot, we have:

Bash:
root@bo-pmx-vmhost-1:~# systemctl status ceph-osd@0
× ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: exit-code) since Fri 2025-08-08 20:06:53 WEST; 11h ago
   Duration: 909ms
 Invocation: ac90d07730e54aa2bade30b829639d6b
    Process: 5686 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
    Process: 5708 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 5708 (code=exited, status=1/FAILURE)

Aug 08 20:06:53 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 3.
Aug 08 20:06:53 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Aug 08 20:06:53 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Failed with result 'exit-code'.
Aug 08 20:06:53 bo-pmx-vmhost-1 systemd[1]: Failed to start ceph-osd@0.service - Ceph object storage daemon osd.0.

I try to start the service manually, but it doesn't work:

Bash:
root@bo-pmx-vmhost-1:~# systemctl start ceph-osd@0
root@bo-pmx-vmhost-1:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Sat 2025-08-09 07:44:17 WEST; 3s ago
 Invocation: 63de195505564303a5b084571507ddd6
    Process: 264147 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
   Main PID: 264151 (ceph-osd)
      Tasks: 19
     Memory: 12.7M (peak: 13.3M)
        CPU: 95ms
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
             └─264151 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Aug 09 07:44:17 bo-pmx-vmhost-1 systemd[1]: Starting ceph-osd@0.service - Ceph object storage daemon osd.0...
Aug 09 07:44:17 bo-pmx-vmhost-1 systemd[1]: Started ceph-osd@0.service - Ceph object storage daemon osd.0.
Aug 09 07:44:18 bo-pmx-vmhost-1 ceph-osd[264151]: 2025-08-09T07:44:18.071+0100 76047b440680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minima>
...skipping...
● ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Sat 2025-08-09 07:44:17 WEST; 3s ago
 Invocation: 63de195505564303a5b084571507ddd6
    Process: 264147 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
   Main PID: 264151 (ceph-osd)
      Tasks: 19
     Memory: 12.7M (peak: 13.3M)
        CPU: 95ms
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
             └─264151 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Aug 09 07:44:17 bo-pmx-vmhost-1 systemd[1]: Starting ceph-osd@0.service - Ceph object storage daemon osd.0...
Aug 09 07:44:17 bo-pmx-vmhost-1 systemd[1]: Started ceph-osd@0.service - Ceph object storage daemon osd.0.
Aug 09 07:44:18 bo-pmx-vmhost-1 ceph-osd[264151]: 2025-08-09T07:44:18.071+0100 76047b440680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minima>

Checking the ceph-osd@osd.service logs:

Bash:
root@bo-pmx-vmhost-1:~# journalctl -u ceph-osd@0 -f
Aug 09 07:44:46 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 2.
Aug 09 07:44:46 bo-pmx-vmhost-1 systemd[1]: Starting ceph-osd@0.service - Ceph object storage daemon osd.0...
Aug 09 07:44:46 bo-pmx-vmhost-1 systemd[1]: Started ceph-osd@0.service - Ceph object storage daemon osd.0.
Aug 09 07:44:46 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:46.647+0100 781b2b076680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minimal_open_bluefs /var/lib/ceph/osd/ceph-0/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 07:44:49 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:49.705+0100 781b2b076680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minimal_open_bluefs /var/lib/ceph/osd/ceph-0/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 07:44:49 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:49.705+0100 781b2b076680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db failed to prepare db environment:
Aug 09 07:44:49 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:49.975+0100 781b2b076680 -1 osd.0 0 OSD:init: unable to mount object store
Aug 09 07:44:49 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:49.975+0100 781b2b076680 -1  ** ERROR: osd init failed: (5) Input/output error
Aug 09 07:44:49 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 07:44:49 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Failed with result 'exit-code'.
Aug 09 07:45:00 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 3.
Aug 09 07:45:00 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Aug 09 07:45:00 bo-pmx-vmhost-1 systemd[1]: ceph-osd@0.service: Failed with result 'exit-code'.
Aug 09 07:45:00 bo-pmx-vmhost-1 systemd[1]: Failed to start ceph-osd@0.service - Ceph object storage daemon osd.0.

The issue seams to be some king of broken link:

Aug 09 07:44:46 bo-pmx-vmhost-1 ceph-osd[264549]: 2025-08-09T07:44:46.647+0100 781b2b076680 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minimal_open_bluefs /var/lib/ceph/osd/ceph-0/block.db symlink exists but target unusable: (13) Permission denied

Investigating the broken link:

Bash:
root@bo-pmx-vmhost-1:~# realpath /var/lib/ceph/osd/ceph-0/block.db
realpath: /var/lib/ceph/osd/ceph-0/block.db: No such file or directory
root@bo-pmx-vmhost-1:~# ls -l /var/lib/ceph/osd/ceph-*/block.db
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-0/block.db -> /dev/ceph-70923913-d761-4fad-8bba-58c663b7b572/osd-db-82369648-6322-4330-bf11-2cdfa7fa7430
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-1/block.db -> /dev/ceph-70923913-d761-4fad-8bba-58c663b7b572/osd-db-6ffc290a-2578-4b5e-bfd2-a5ba6b87b039
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-2/block.db -> /dev/ceph-e09583f5-e3da-4292-9328-1ff324a51396/osd-db-54f7018f-104d-445e-abf4-634740c21c43
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-3/block.db -> /dev/ceph-e09583f5-e3da-4292-9328-1ff324a51396/osd-db-0511dba8-b75e-4c67-858e-24070573293e
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-4/block.db -> /dev/ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c/osd-db-dde1b91d-0d00-40e6-a91c-e7c318c1a06b
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-5/block.db -> /dev/ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c/osd-db-620cd823-8117-4283-b414-af337c32f0d9
root@bo-pmx-vmhost-1:~# ls -l /var/lib/ceph/osd/ceph-0/block.db
lrwxrwxrwx 1 root root 90 Aug  8 20:09 /var/lib/ceph/osd/ceph-0/block.db -> /dev/ceph-70923913-d761-4fad-8bba-58c663b7b572/osd-db-82369648-6322-4330-bf11-2cdfa7fa7430

So, for what I can understand (which is not much), the issue occurs because LVM doesn't automatically activate volume groups for devices that weren't cleanly shut down. Therefore the symlink is pointing to an inactive LVM volume...

So I tried to add to /etc/lvm/lvm.conf the setting:
auto_activation_volume_list = [ "*" ]

Then I rebooted the node #1 and checked, but it didn't solve anything (I think that by default LVM activates all volumes, so it makes sense).

So, the conclusion is that I must be doing some rookie mistake, because it doesn't make any sense that after a simple reboot, I get the node unable to recover its data... And I've to destroy all nodes OSDs and rebuild the OSDs from scratch. I makes the cluster very fragile and not reliable for production.

I ask someone with more experience, if you know what mistake I'm making or if this is really a bug of some sorts...

Thank you for your attention.
 

Attachments

Last edited:
The node configuration attached file from the original post:

Bash:
root@bo-pmx-vmhost-1:~# pveversion

pve-manager/9.0.3/025864202ebb6109 (running kernel: 6.14.8-2-pve)



Code:
root@bo-pmx-vmhost-1:~# cat /etc/network/interfaces

# network interface settings; autogenerated

# Please do NOT modify this file directly, unless you know what

# you're doing.

#

# If you want to manage parts of the network configuration manually,

# please utilize the 'source' or 'source-directory' directives to do

# so.

# PVE will preserve these directives, but will NOT read its network

# configuration from sourced files, so do not attempt to move any of

# the PVE managed interfaces into external files!



auto lo

iface lo inet loopback



iface ens6f0 inet manual



auto ens7f0np0

iface ens7f0np0 inet manual

        mtu 9000



auto ens7f1np1

iface ens7f1np1 inet manual

        mtu 9000



auto ens7f2np2

iface ens7f2np2 inet manual

        mtu 9000



auto ens7f3np3

iface ens7f3np3 inet manual

        mtu 9000



iface ens6f1 inet manual



iface ens6f2 inet manual



iface ens6f3 inet manual



auto bond0

iface bond0 inet manual

        bond-slaves ens7f0np0 ens7f1np1 ens7f2np2 ens7f3np3

        bond-miimon 100

        bond-mode 802.3ad

        bond-xmit-hash-policy layer3+4

        mtu 9000

#LACP 4 Ports



auto bond0.1000

iface bond0.1000 inet manual

        mtu 9000

#VLAN 1000 - Cluster Corosync



auto bond0.1001

iface bond0.1001 inet manual

        mtu 9000

#VLAN 1001 - Ceph Public



auto bond0.1002

iface bond0.1002 inet manual

        mtu 9000

#VLAN 1002 - Ceph Cluster



auto bond0.90

iface bond0.90 inet manual

        mtu 1500

#VLAN - Datacenter VMs



auto bond0.104

iface bond0.104 inet manual

        mtu 1500

#VLAN Datacenter DMZ



auto vmbr0

iface vmbr0 inet static

        address 172.25.10.21/24

        gateway 172.25.10.1

        bridge-ports ens6f0

        bridge-stp off

        bridge-fd 0

#LAN - Datacenter - Management



auto vmbr1

iface vmbr1 inet static

        address 10.10.10.101/24

        bridge-ports bond0.1000

        bridge-stp off

        bridge-fd 0

        mtu 9000

#LAN - Cluster - Corosync



auto vmbr2

iface vmbr2 inet static

        address 10.10.0.101/24

        bridge-ports bond0.1001

        bridge-stp off

        bridge-fd 0

        mtu 9000

#VLAN 1001 - Ceph Public



auto vmbr3

iface vmbr3 inet static

        address 10.20.0.101/24

        bridge-ports bond0.1002

        bridge-stp off

        bridge-fd 0

        mtu 9000

#VLAN 1002 - Ceph Cluster



auto vmbr4

iface vmbr4 inet manual

        bridge-ports bond0.90

        bridge-stp off

        bridge-fd 0

        mtu 1500

#LAN - Datacenter - VMs



auto vmbr5

iface vmbr5 inet manual

        bridge-ports bond0.104

        bridge-stp off

        bridge-fd 0

        mtu 1500

#LAN - Datacenter - DMZ



source /etc/network/interfaces.d/*



Code:
root@bo-pmx-vmhost-1:~# cat /etc/ceph/ceph.conf

[global]

        auth_client_required = cephx

        auth_cluster_required = cephx

        auth_service_required = cephx

        cluster_network = 10.20.0.0/24

        fsid = d540f3ee-2772-45e3-b2b4-9ff11a075f3f

        mon_allow_pool_delete = true

        mon_host = 10.10.0.101 10.10.0.102 10.10.0.103

        ms_bind_ipv4 = true

        ms_bind_ipv6 = false

        osd_pool_default_min_size = 2

        osd_pool_default_size = 3

        public_network = 10.10.0.0/24



[client]

        keyring = /etc/pve/priv/$cluster.$name.keyring



[client.crash]

        keyring = /etc/pve/ceph/$cluster.$name.keyring



[mon.bo-pmx-vmhost-1]

        public_addr = 10.10.0.101



[mon.bo-pmx-vmhost-2]

        public_addr = 10.10.0.102



[mon.bo-pmx-vmhost-3]

        public_addr = 10.10.0.103



Bash:
root@bo-pmx-vmhost-1:~# ceph -s

  cluster:

    id:     d540f3ee-2772-45e3-b2b4-9ff11a075f3f

    health: HEALTH_WARN

            1 osds down

            1 host (6 osds) down

            Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded, 65 pgs undersized

 

  services:

    mon: 3 daemons, quorum bo-pmx-vmhost-1,bo-pmx-vmhost-2,bo-pmx-vmhost-3 (age 12h)

    mgr: bo-pmx-vmhost-2(active, since 22h), standbys: bo-pmx-vmhost-3, bo-pmx-vmhost-1

    osd: 18 osds: 12 up (since 22h), 13 in (since 22h)

 

  data:

    pools:   3 pools, 65 pgs

    objects: 2 objects, 4.1 MiB

    usage:   2.1 TiB used, 21 TiB / 23 TiB avail

    pgs:     2/6 objects degraded (33.333%)

             64 active+undersized

             1  active+undersized+degraded





Bash:
root@bo-pmx-vmhost-1:~# ceph osd tree

ID  CLASS  WEIGHT    TYPE NAME                 STATUS  REWEIGHT  PRI-AFF

-1         34.58139  root default                                       

-3         11.52713      host bo-pmx-vmhost-1                           

 0    ssd   1.92119          osd.0               down         0  1.00000

 1    ssd   1.92119          osd.1               down         0  1.00000

 2    ssd   1.92119          osd.2               down         0  1.00000

 3    ssd   1.92119          osd.3               down         0  1.00000

 4    ssd   1.92119          osd.4               down         0  1.00000

 5    ssd   1.92119          osd.5               down   1.00000  1.00000

-5         11.52713      host bo-pmx-vmhost-2                           

 6    ssd   1.92119          osd.6                 up   1.00000  1.00000

 7    ssd   1.92119          osd.7                 up   1.00000  1.00000

 8    ssd   1.92119          osd.8                 up   1.00000  1.00000

 9    ssd   1.92119          osd.9                 up   1.00000  1.00000

10    ssd   1.92119          osd.10                up   1.00000  1.00000

11    ssd   1.92119          osd.11                up   1.00000  1.00000

-7         11.52713      host bo-pmx-vmhost-3                           

12    ssd   1.92119          osd.12                up   1.00000  1.00000

13    ssd   1.92119          osd.13                up   1.00000  1.00000

14    ssd   1.92119          osd.14                up   1.00000  1.00000

15    ssd   1.92119          osd.15                up   1.00000  1.00000

16    ssd   1.92119          osd.16                up   1.00000  1.00000

17    ssd   1.92119          osd.17                up   1.00000  1.00000



Bash:
root@bo-pmx-vmhost-1:~# df

Filesystem           1K-blocks    Used Available Use% Mounted on

udev                 263959488       0 263959488   0% /dev

tmpfs                 52800760    2972  52797788   1% /run

/dev/mapper/pve-root  98497780 5483024  87965208   6% /

tmpfs                264003788   61248 263942540   1% /dev/shm

efivarfs                   496     417        74  85% /sys/firmware/efi/efivars

tmpfs                     5120       0      5120   0% /run/lock

tmpfs                     1024       0      1024   0% /run/credentials/systemd-journald.service

tmpfs                264003788       0 264003788   0% /tmp

/dev/sdj2              1046508    8976   1037532   1% /boot/efi

/dev/fuse               131072      56    131016   1% /etc/pve

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-4

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-2

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-1

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-3

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-0

tmpfs                264003788      24 264003764   1% /var/lib/ceph/osd/ceph-5

tmpfs                     1024       0      1024   0% /run/credentials/getty@tty1.service

tmpfs                 52800756       8  52800748   1% /run/user/0



Bash:
root@bo-pmx-vmhost-1:~# ceph osd pool ls detail

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 20 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 18.18

pool 4 'vm-pool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2162 lfor 0/2162/2160 flags hashpspool stripe_width 0 application rbd read_balance_score 2.24

pool 5 'db-pool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1667 lfor 0/1667/1665 flags hashpspool stripe_width 0 application rbd read_balance_score 3.93



Bash:
root@bo-pmx-vmhost-1:~# ceph osd crush rule dump

[

    {

        "rule_id": 0,

        "rule_name": "replicated_rule",

        "type": 1,

        "steps": [

            {

                "op": "take",

                "item": -1,

                "item_name": "default"

            },

            {

                "op": "chooseleaf_firstn",

                "num": 0,

                "type": "host"

            },

            {

                "op": "emit"

            }

        ]

    }

]



Bash:
root@bo-pmx-vmhost-1:~# ceph versions

{

    "mon": {

        "ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)": 3

    },

    "mgr": {

        "ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)": 3

    },

    "osd": {

        "ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)": 12

    },

    "overall": {

        "ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)": 18

    }

}



Bash:
root@bo-pmx-vmhost-1:~# ceph crash ls

root@bo-pmx-vmhost-1:~#



Bash:
root@bo-pmx-vmhost-1:~# ceph pg dump osds

OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS                       PG_SUM  PRIMARY_PG_SUM

17        179 GiB  1.7 TiB   179 GiB  1.9 TiB    [6,7,8,9,10,11,12,13,14,16]       9               4

16        179 GiB  1.7 TiB   179 GiB  1.9 TiB    [6,7,8,9,10,11,12,13,15,17]      12               3

15        179 GiB  1.7 TiB   179 GiB  1.9 TiB    [6,7,8,9,10,11,12,14,16,17]      14               7

14        179 GiB  1.7 TiB   179 GiB  1.9 TiB    [6,7,8,9,10,11,13,15,16,17]       9               6

13        179 GiB  1.7 TiB   179 GiB  1.9 TiB    [6,7,8,9,10,12,14,15,16,17]       9               4

12        179 GiB  1.7 TiB   179 GiB  1.9 TiB   [7,8,9,10,11,13,14,15,16,17]      12               5

11        179 GiB  1.7 TiB   179 GiB  1.9 TiB   [6,7,8,10,12,13,14,15,16,17]      14               9

10        179 GiB  1.7 TiB   179 GiB  1.9 TiB   [6,7,9,11,12,13,14,15,16,17]       9               5

9         179 GiB  1.7 TiB   179 GiB  1.9 TiB  [6,8,10,11,12,13,14,15,16,17]      11               5

8         179 GiB  1.7 TiB   179 GiB  1.9 TiB  [7,9,10,11,12,13,14,15,16,17]       8               5

7         179 GiB  1.7 TiB   179 GiB  1.9 TiB   [6,8,9,10,12,13,14,15,16,17]      12               6

6         179 GiB  1.7 TiB   179 GiB  1.9 TiB   [7,8,9,10,11,12,13,14,16,17]      11               6

sum       2.1 TiB   21 TiB   2.1 TiB   23 TiB                                                       

dumped osds











LVM:



Bash:
root@bo-pmx-vmhost-1:~# lvs

  LV                                             VG                                        Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert

  osd-block-64f623f3-e7e0-4427-800d-001d68395752 ceph-23dd2e01-ec30-4b2c-a966-eb6470118a8f -wi-a-----   <1.75t                                                   

  osd-block-6382361f-cd3d-480e-95ae-2795365bc057 ceph-45cf05f7-6932-4e5b-bc05-8ee4e635b91c -wi-a-----   <1.75t                                                   

  osd-db-6ffc290a-2578-4b5e-bfd2-a5ba6b87b039    ceph-70923913-d761-4fad-8bba-58c663b7b572 -wi-------  178.85g                                                   

  osd-db-82369648-6322-4330-bf11-2cdfa7fa7430    ceph-70923913-d761-4fad-8bba-58c663b7b572 -wi-------  178.85g                                                   

  osd-block-8d6509db-5924-4d2e-9f26-87e4f1e61956 ceph-7bde6d27-c854-420b-a67e-9e3a8719d67b -wi-a-----   <1.75t                                                   

  osd-block-71e3207a-e995-4d68-abe7-3fe52ea69b07 ceph-8a347808-e50e-4434-9fde-7fd5ae866145 -wi-a-----   <1.75t                                                   

  osd-db-620cd823-8117-4283-b414-af337c32f0d9    ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c -wi-------  178.85g                                                   

  osd-db-dde1b91d-0d00-40e6-a91c-e7c318c1a06b    ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c -wi-------  178.85g                                                   

  osd-block-859b5141-5a06-4ed5-b52f-d1122e0ea660 ceph-c91ac7c3-a4f1-4fd0-8fcc-4b96db47c43a -wi-a-----   <1.75t                                                   

  osd-block-b1e999dd-29a1-4b60-b465-147bd40ed803 ceph-ced07cd7-ad32-4922-991a-6b4ceb909ea6 -wi-a-----   <1.75t                                                   

  osd-db-0511dba8-b75e-4c67-858e-24070573293e    ceph-e09583f5-e3da-4292-9328-1ff324a51396 -wi-------  178.85g                                                   

  osd-db-54f7018f-104d-445e-abf4-634740c21c43    ceph-e09583f5-e3da-4292-9328-1ff324a51396 -wi-------  178.85g                                                   

  data                                           pve                                       twi-a-tz-- <319.58g             0.00   0.52                           

  root                                           pve                                       -wi-ao----   96.00g                                                   

  swap                                           pve                                       -wi-ao----    8.00g                                                   

root@bo-pmx-vmhost-1:~# vgs

  VG                                        #PV #LV #SN Attr   VSize    VFree   

  ceph-23dd2e01-ec30-4b2c-a966-eb6470118a8f   1   1   0 wz--n-   <1.75t       0

  ceph-45cf05f7-6932-4e5b-bc05-8ee4e635b91c   1   1   0 wz--n-   <1.75t       0

  ceph-70923913-d761-4fad-8bba-58c663b7b572   1   2   0 wz--n-  894.25g <536.55g

  ceph-7bde6d27-c854-420b-a67e-9e3a8719d67b   1   1   0 wz--n-   <1.75t       0

  ceph-8a347808-e50e-4434-9fde-7fd5ae866145   1   1   0 wz--n-   <1.75t       0

  ceph-8dac7266-745e-4944-b13e-58c8d1d9d05c   1   2   0 wz--n-  894.25g <536.55g

  ceph-c91ac7c3-a4f1-4fd0-8fcc-4b96db47c43a   1   1   0 wz--n-   <1.75t       0

  ceph-ced07cd7-ad32-4922-991a-6b4ceb909ea6   1   1   0 wz--n-   <1.75t       0

  ceph-e09583f5-e3da-4292-9328-1ff324a51396   1   2   0 wz--n-  894.25g <536.55g

  pve                                         1   3   0 wz--n- <446.10g   16.00g
 
So...

I've destroyed all node #1 OSDs and created all OSDs from scratch.

The Ceph accepted it all and soon became Healthy again.

It's a bad solution to have to destroy and rebuild all OSDs each time we have node reboot.

So, after ceph became healthy again with all disks, I rebooted node #3 and the same issue again:
Captura de ecrã 2025-08-09, às 21.17.10.jpg

Checking the logs:

Bash:
root@bo-pmx-vmhost-3:~# systemctl start ceph-osd@12
root@bo-pmx-vmhost-3:~# systemctl status ceph-osd@12.service
● ceph-osd@12.service - Ceph object storage daemon osd.12
     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: activating (auto-restart) (Result: exit-code) since Sat 2025-08-09 21:14:15 WEST; 6s ago
 Invocation: d16be7a4b19a48709f9a3af4fa9cb38f
    Process: 23531 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 12 (code=exited, status=0/SUCCESS)
    Process: 23536 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 12 --setuser ceph --setgroup ceph (code=exited, status=1/FAILU>
   Main PID: 23536 (code=exited, status=1/FAILURE)
   Mem peak: 41.3M
        CPU: 133ms
root@bo-pmx-vmhost-3:~# journalctl -u ceph-osd@12.service -b -n 200 --no-pager
Aug 09 21:14:11 bo-pmx-vmhost-3 systemd[1]: Starting ceph-osd@12.service - Ceph object storage daemon osd.12...
Aug 09 21:14:11 bo-pmx-vmhost-3 systemd[1]: Started ceph-osd@12.service - Ceph object storage daemon osd.12.
Aug 09 21:14:12 bo-pmx-vmhost-3 ceph-osd[23536]: 2025-08-09T21:14:12.385+0100 752b3fc79680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _minimal_open_bluefs /var/lib/ceph/osd/ceph-12/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 21:14:15 bo-pmx-vmhost-3 ceph-osd[23536]: 2025-08-09T21:14:15.485+0100 752b3fc79680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _minimal_open_bluefs /var/lib/ceph/osd/ceph-12/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 21:14:15 bo-pmx-vmhost-3 ceph-osd[23536]: 2025-08-09T21:14:15.485+0100 752b3fc79680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _open_db failed to prepare db environment:
Aug 09 21:14:15 bo-pmx-vmhost-3 ceph-osd[23536]: 2025-08-09T21:14:15.754+0100 752b3fc79680 -1 osd.12 0 OSD:init: unable to mount object store
Aug 09 21:14:15 bo-pmx-vmhost-3 ceph-osd[23536]: 2025-08-09T21:14:15.754+0100 752b3fc79680 -1  ** ERROR: osd init failed: (5) Input/output error
Aug 09 21:14:15 bo-pmx-vmhost-3 systemd[1]: ceph-osd@12.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 21:14:15 bo-pmx-vmhost-3 systemd[1]: ceph-osd@12.service: Failed with result 'exit-code'.
Aug 09 21:14:25 bo-pmx-vmhost-3 systemd[1]: ceph-osd@12.service: Scheduled restart job, restart counter is at 1.
Aug 09 21:14:25 bo-pmx-vmhost-3 systemd[1]: Starting ceph-osd@12.service - Ceph object storage daemon osd.12...
Aug 09 21:14:25 bo-pmx-vmhost-3 systemd[1]: Started ceph-osd@12.service - Ceph object storage daemon osd.12.
Aug 09 21:14:26 bo-pmx-vmhost-3 ceph-osd[23759]: 2025-08-09T21:14:26.387+0100 77fad184f680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _minimal_open_bluefs /var/lib/ceph/osd/ceph-12/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 21:14:29 bo-pmx-vmhost-3 ceph-osd[23759]: 2025-08-09T21:14:29.707+0100 77fad184f680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _minimal_open_bluefs /var/lib/ceph/osd/ceph-12/block.db symlink exists but target unusable: (13) Permission denied
Aug 09 21:14:29 bo-pmx-vmhost-3 ceph-osd[23759]: 2025-08-09T21:14:29.707+0100 77fad184f680 -1 bluestore(/var/lib/ceph/osd/ceph-12) _open_db failed to prepare db environment:
Aug 09 21:14:29 bo-pmx-vmhost-3 ceph-osd[23759]: 2025-08-09T21:14:29.977+0100 77fad184f680 -1 osd.12 0 OSD:init: unable to mount object store
Aug 09 21:14:29 bo-pmx-vmhost-3 ceph-osd[23759]: 2025-08-09T21:14:29.977+0100 77fad184f680 -1  ** ERROR: osd init failed: (5) Input/output error
Aug 09 21:14:29 bo-pmx-vmhost-3 systemd[1]: ceph-osd@12.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 21:14:29 bo-pmx-vmhost-3 systemd[1]: ceph-osd@12.service: Failed with result 'exit-code'.

So, it's the exact same issue as the original post, with node #1!

We can't do a reboot, or else, the entire OSDs disks must be destroyed and rebuilt, which is a very critical issue for a cluster...

I suspect the issue might be:
  • Bug in Proxmox VE 9.0.3
  • Bug in the upgrade process from Proxmox 8 to 9
  • Bug in some mechanism of ceph
 
Ok...

So, I reinstalled in all 3 nodes Proxmox VE 8.4.8, with Ceph 19.2 (the same version as before), with the exact same configuration (network, cluster, names, ceph, etc).

And the good news is that the issue doesn't happen in Proxmox VE 8.4.8!!
Each node reboots and Ceph recovers as expected, automatically, without any fuss! The OSD services all come up and sync without issue!

So this is clearly a bug, from 2 possibilities:
  • Bug in Proxmox VE 9.0.3
  • Bug in the upgrade process from Proxmox 8 to 9
So for those who are thinking on upgrading a Proxmox Cluster with Ceph, I recommend that you test it very well before you put it in production.

Proxmox VE 9 seams to need to mature and be more tested before it goes in Production, specially in clusters.

Anyone knows how can we report bugs for Proxmox 9?