Issue: Unable to reach Ceph after upgrade to Squid (19.2)

JohnnyMegs

New Member
Apr 15, 2026
2
0
1
Good day. I have recently updated my Ceph from Reef (18.2.8) to Squid (19.2) in preparation for upgrading my ProxMox systems from v8 to v9. I followed the below documentation, to prepare:
https://pve.proxmox.com/wiki/Upgrade_from_8_to_9#In-place_upgrade

To which I then completed these steps:
https://pve.proxmox.com/wiki/Ceph_Reef_to_Squid

I believe that I followed everything properly and, after upgrading Ceph, I rebooted each of my nodes, one after another. I was leaving the system 'as is' for two weeks, to make sure I didn't break anything. All seemed well and there were no issues. However, yesterday, there were some massive storms rolling through and I thought to power down my equipment. When I brought it back online, I found that ceph and the storage pool were unreachable. My nodes (Node1010, Node1011 and Node 1012) are themselves fine, with ipv4 in place and the cluster itself being healthy.

Issue:
Ceph is unreachable. The Proxmox web GUI displays a '500 Unreachable' for any ceph related display, except for the configuration page. On the cluster, ceph reports as '?' with no working monitors or managers. Any 'ceph' command simply hangs on the terminal. The pool that is used with it is also non responsive.

System Specifications:
Nodes: 3 (identical specifications)
Kernel: Linux 6.8.12-20-pve
Manager Version: pve-manager/8.4.18/40eb4ac16f053344
Ceph: Squid (19.2)

Steps I've taken:
I've followed the below documentations on troubleshooting, to which there was no change in the issue.
https://github.com/HomeLabHD/Ceph-Disaster_Recovery_in_PVE
https://cr0x.net/en/proxmox-unable-activate-storage-diagnose/

Log/status results are below:
systemctl --type=service --state=running

UNIT LOAD ACTIVE SUB DESCRIPTION
ceph-crash.service loaded active running Ceph crash dump collector
chrony.service loaded active running chrony, an NTP client/server
corosync.service loaded active running Corosync Cluster Engine
cron.service loaded active running Regular background program processing daemon
dbus.service loaded active running D-Bus System Message Bus
dm-event.service loaded active running Device-mapper event daemon
getty@tty1.service loaded active running Getty on tty1
ksmtuned.service loaded active running Kernel Samepage Merging (KSM) Tuning Daemon
lxc-monitord.service loaded active running LXC Container Monitoring Daemon
lxcfs.service loaded active running FUSE filesystem for LXC
postfix@-.service loaded active running Postfix Mail Transport Agent (instance -)
proxmox-firewall.service loaded active running Proxmox nftables firewall
pve-cluster.service loaded active running The Proxmox VE cluster filesystem
pve-firewall.service loaded active running Proxmox VE firewall
pve-ha-crm.service loaded active running PVE Cluster HA Resource Manager Daemon
pve-ha-lrm.service loaded active running PVE Local HA Resource Manager Daemon
pve-lxc-syscalld.service loaded active running Proxmox VE LXC Syscall Daemon
pvedaemon.service loaded active running PVE API Daemon
pvefw-logger.service loaded active running Proxmox VE firewall logger
pveproxy.service loaded active running PVE API Proxy Server
pvescheduler.service loaded active running Proxmox VE scheduler
pvestatd.service loaded active running PVE Status Daemon
qmeventd.service loaded active running PVE Qemu Event Daemon
rpcbind.service loaded active running RPC bind portmap service
rrdcached.service loaded active running LSB: start or stop rrdcached
smartmontools.service loaded active running Self Monitoring and Reporting Technology (SMART) Daemon
spiceproxy.service loaded active running PVE SPICE Proxy Server
ssh.service loaded active running OpenBSD Secure Shell server
systemd-journald.service loaded active running Journal Service
systemd-logind.service loaded active running User Login Management
systemd-udevd.service loaded active running Rule-based Manager for Device Events and Files
user@0.service loaded active running User Manager for UID 0
watchdog-mux.service loaded active running Proxmox VE watchdog multiplexer
zfs-zed.service loaded active running ZFS Event Daemon (zed)

journalctl -xeu ceph-mon@Node1010.service
Apr 15 08:30:29 Node1010 systemd[1]: ceph-mon@Node1010.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit ceph-mon@Node1010.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 15 08:30:29 Node1010 systemd[1]: ceph-mon@Node1010.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit ceph-mon@Node1010.service has entered the 'failed' state with result 'exit-code'.
Apr 15 08:30:38 Node1010 systemd[1]: Stopped ceph-mon@Node1010.service - Ceph cluster monitor daemon.
░░ Subject: A stop job for unit ceph-mon@Node1010.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit ceph-mon@Node1010.service has finished.
░░
░░ The job identifier is 4206 and the job result is done.
Apr 15 08:30:55 Node1010 systemd[1]: ceph-mon@Node1010.service: Start request repeated too quickly.
Apr 15 08:30:55 Node1010 systemd[1]: ceph-mon@Node1010.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit ceph-mon@Node1010.service has entered the 'failed' state with result 'exit-code'.
Apr 15 08:30:55 Node1010 systemd[1]: Failed to start ceph-mon@Node1010.service - Ceph cluster monitor daemon.
░░ Subject: A start job for unit ceph-mon@Node1010.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit ceph-mon@Node1010.service has finished with a failure.
░░
░░ The job identifier is 4207 and the job result is failed.

/etc/pve/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.10.110/24
fsid = 714f7299-923a-4416-a750-cec9016567f2
mon_allow_pool_delete = true
mon_host = 192.168.10.10 192.168.10.11 192.168.10.12
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.10.110/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.Node1010]
public_addr = 192.168.10.10

[mon.Node1011]
public_addr = 192.168.10.11

[mon.Node1012]
public_addr = 192.168.10.12

pvesm status
got timeout
Name Type Status Total Used Available %
Pool01 rbd inactive 0 0 0 0.00%
local dir active 98497780 54073844 39374388 54.90%

vgs
VG #PV #LV #SN Attr VSize VFree
ceph-e4678451-d485-47df-b6f1-cf9b6e7f92e8 1 1 0 wz--n- <1.82t 0
pve 1 3 0 wz--n- <930.51g 16.00g

Thanks in advance for your help. Let me know if there is anything else that I can add.
 
The issue is something that I didn't expect and I'm willing to mark this as solved. Two of the three nodes had maxed out primary hdds (not the pool) because of a backup job. I didn't notice it until now and didn't expect something else to be affected, like this. Honestly, if the mods/admins just delete this one, since the solution was no where near the issue, I'd completely understand.