No VMs with CEPH Storage will start after update to 8.3.2 and CEPH Squid

Chaparral Wireless · Thursday at 17:48

Back story,
We have a 5 node Proxmox Cluster 3 Compute nodes and 2 NAS CEPH Nodes (Yes I know this is not optimal but we have another storage and compute node but we have not finished migrating our old vmware off them yet)

Primary Network: 2x 10gbps LCAP bond on each server 10.10.104.x/24
Cluster/CEPH Network: 2x 25gbps balanced-xor 172.30.250.x/24

Compute Nodes:
Dell PowerEdge R650
1TB RAM
2TB local-zfs

CEPH Nodes:
Supermicro
512GB RAM
Spinner/Bulk Storage (HDD) 196TB
SSD 96TB
NVME 32TB

Compute Nodes and CEPH Nodes were updated to PVE 8.3.2 and CEPH Squid after reboot only VMs/CTs that were on local-zfs storage will boot up. Everything that is on CEPH storage just sits there and spins. When I click on the task it just says "No Content". CEPH shows 91% Recovery/Rebalance and is going painfully slow. Unfortunately we do not have good backups... (I know this is stupid and it is being addressed). Is there any way to recover this? Is there a way to pop the drives out, Re-install the nodes with fresh PVE/CEPH and recover OSDs/VMs? Any assistance would be greatly appreciated!

alexskysilk · Thursday at 17:52

eallen@chaparralwireless. said:
We have a 5 node Proxmox Cluster 3 Compute nodes and 2 NAS CEPH Nodes

I dont actually understand what this means.

eallen@chaparralwireless. said:
Everything that is on CEPH storage just sits there and spins.

post the output of

ceph health detail
ceph -s
/etc/pve/ceph.conf

Chaparral Wireless · Thursday at 17:59

ceph health detail:
HEALTH_WARN 1/4 mons down, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1; all OSDs are running squid or later but require_osd_release < squid; 2 daemons have recently crashed; 64 slow ops, oldest one blocked for 395 sec, mon.CHC-000-Prox03 has slow ops
[WRN] MON_DOWN: 1/4 mons down, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1
mon.CHC-000-Prox01 (rank 0) addr [v2:10.10.104.10:3300/0,v1:10.10.104.10:6789/0] is down (out of quorum)
[WRN] OSD_UPGRADE_FINISHED: all OSDs are running squid or later but require_osd_release < squid
all OSDs are running squid or later but require_osd_release < squid
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T11:45:25.202782Z
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T12:39:08.220237Z
[WRN] SLOW_OPS: 64 slow ops, oldest one blocked for 395 sec, mon.CHC-000-Prox03 has slow ops

ceph -s:
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
1/4 mons down, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1
all OSDs are running squid or later but require_osd_release < squid
2 daemons have recently crashed
64 slow ops, oldest one blocked for 420 sec, mon.CHC-000-Prox03 has slow ops

services:
mon: 4 daemons, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1 (age 2d), out of quorum: CHC-000-Prox01
mgr: CHC-000-ProxNAS01(active, since 2d), standbys: pve1, CHC-000-Prox03, CHC-000-Prox04, CHC-000-ProxNAS03
mds: 1/1 daemons up, 1 standby
osd: 28 osds: 28 up (since 2d), 28 in (since 2d); 38 remapped pgs

data:
volumes: 1/1 healthy
pools: 6 pools, 545 pgs
objects: 7.70M objects, 29 TiB
usage: 59 TiB used, 152 TiB / 211 TiB avail
pgs: 1350177/15397010 objects misplaced (8.769%)
504 active+clean
36 active+remapped+backfill_wait
3 active+clean+scrubbing+deep
2 active+remapped+backfilling

io:
recovery: 27 MiB/s, 6 objects/s

/etc/pve/ceph.conf:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 172.30.250.0/24
fsid = 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
mon_allow_pool_delete = true
mon_host = 172.30.250.2 172.30.250.4 172.30.250.5 172.30.250.12 172.30.250.14
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.104.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.CHC-000-ProxNAS01-01]
host = CHC-000-ProxNAS01
mds_standby_for_name = pve

[mds.CHC-000-ProxNAS03-01]
host = CHC-000-ProxNAS03
mds_standby_for_name = pve

[mon.CHC-000-Prox03]
public_addr = 172.30.250.4

[mon.CHC-000-ProxNAS01-1]
public_addr = 172.30.250.12

[mon.CHC-000-ProxNAS03-1]
public_addr = 172.30.250.14

alexskysilk · Thursday at 18:09

you have connectivity issues. You need to re-establish connectivity to CHC-000-Prox01. Once you re-establish connectivity, kill one monitor. it serves no purpose and just generates/requires pointless IO.

We can continue troubleshooting (if still necessary) once thats done.

Chaparral Wireless · Thursday at 18:14

I fixed the issue on CHC-000-Prox01 that node died and was rebuilt. it is PVE1 now... but still have the same issue

ceph health detail
HEALTH_WARN all OSDs are running squid or later but require_osd_release < squid; 2 daemons have recently crashed; 75 slow ops, oldest one blocked for 170 sec, mon.CHC-000-Prox03 has slow ops
[WRN] OSD_UPGRADE_FINISHED: all OSDs are running squid or later but require_osd_release < squid
all OSDs are running squid or later but require_osd_release < squid
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T11:45:25.202782Z
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T12:39:08.220237Z
[WRN] SLOW_OPS: 75 slow ops, oldest one blocked for 170 sec, mon.CHC-000-Prox03 has slow ops

ceph -s
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
all OSDs are running squid or later but require_osd_release < squid
2 daemons have recently crashed
75 slow ops, oldest one blocked for 255 sec, mon.CHC-000-Prox03 has slow ops

services:
mon: 5 daemons, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1 (age 4m)
mgr: CHC-000-ProxNAS01(active, since 2d), standbys: pve1, CHC-000-Prox03, CHC-000-Prox04, CHC-000-ProxNAS03
mds: 1/1 daemons up, 1 standby
osd: 28 osds: 28 up (since 2d), 28 in (since 2d); 38 remapped pgs

data:
volumes: 1/1 healthy
pools: 6 pools, 545 pgs
objects: 7.70M objects, 29 TiB
usage: 58 TiB used, 153 TiB / 211 TiB avail
pgs: 1339492/15397010 objects misplaced (8.700%)
503 active+clean
36 active+remapped+backfill_wait
3 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+scrubbing

io:
recovery: 20 MiB/s, 5 objects/s

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 172.30.250.0/24
fsid = 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
mon_allow_pool_delete = true
mon_host = 172.30.250.2 172.30.250.3 172.30.250.4 172.30.250.5 172.30.250.12 172.30.250.14 10.10.104.13 10.10.104.10
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.104.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.CHC-000-ProxNAS01-01]
host = CHC-000-ProxNAS01
mds_standby_for_name = pve

[mds.CHC-000-ProxNAS03-01]
host = CHC-000-ProxNAS03
mds_standby_for_name = pve

[mon.CHC-000-Prox03]
public_addr = 172.30.250.4

[mon.CHC-000-Prox04]
public_addr = 172.30.250.5

[mon.CHC-000-ProxNAS01-1]
public_addr = 172.30.250.12

[mon.CHC-000-ProxNAS03-1]
public_addr = 172.30.250.14

[mon.pve1]
public_addr = 172.30.250.2

alexskysilk · Thursday at 18:30

ok, now kill the monitor on CHC-000-Prox03.

Chaparral Wireless · Thursday at 18:32

done
ceph health detail
HEALTH_WARN 1/5 mons down, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1; all OSDs are running squid or later but require_osd_release < squid; 2 daemons have recently crashed; 40 slow ops, oldest one blocked for 624 sec, daemons [mon.CHC-000-Prox03,mon.CHC-000-ProxNAS03-1] have slow ops.
[WRN] MON_DOWN: 1/5 mons down, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1
mon.CHC-000-Prox03 (rank 0) addr [v2:172.30.250.4:3300/0,v1:172.30.250.4:6789/0] is down (out of quorum)
[WRN] OSD_UPGRADE_FINISHED: all OSDs are running squid or later but require_osd_release < squid
all OSDs are running squid or later but require_osd_release < squid
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T11:45:25.202782Z
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T12:39:08.220237Z
[WRN] SLOW_OPS: 40 slow ops, oldest one blocked for 624 sec, daemons [mon.CHC-000-Prox03,mon.CHC-000-ProxNAS03-1] have slow ops.

ceph -s
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
1/5 mons down, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1
all OSDs are running squid or later but require_osd_release < squid
2 daemons have recently crashed
40 slow ops, oldest one blocked for 624 sec, daemons [mon.CHC-000-Prox03,mon.CHC-000-ProxNAS03-1] have slow ops.

services:
mon: 5 daemons, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1 (age 5m), out of quorum: CHC-000-Prox03
mgr: CHC-000-ProxNAS01(active, since 2d), standbys: pve1, CHC-000-Prox03, CHC-000-Prox04, CHC-000-ProxNAS03
mds: 1/1 daemons up, 1 standby
osd: 28 osds: 28 up (since 2d), 28 in (since 2d); 38 remapped pgs

data:
volumes: 1/1 healthy
pools: 6 pools, 545 pgs
objects: 7.70M objects, 29 TiB
usage: 58 TiB used, 153 TiB / 211 TiB avail
pgs: 1325771/15397010 objects misplaced (8.611%)
502 active+clean
36 active+remapped+backfill_wait
5 active+clean+scrubbing+deep
2 active+remapped+backfilling

io:
recovery: 27 MiB/s, 6 objects/s

alexskysilk · Thursday at 20:21

Chaparral Wireless said:
1/4 mons down, quorum CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1

Chaparral Wireless said:
1/5 mons down, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,CHC-000-Prox04,pve1

I dont know what you meant when you said "done" but its not leading to that. If you're doing other stuff and would like me to help, I cant very well operate blind to it.

you should end up with 3 monitors, NONE of which is on CHC-000-Prox03, and none reporting down.

in any event, your pools are reporting healthy; if you're still having guest issues, time to move the troubleshooting to the proxmox layer. What do you see for pvesm status?

Chaparral Wireless · Thursday at 20:46

I am also getting "Connection reset by peer (596)" or "communication failure (0)" when I try to navigate to any of the Ceph Storage mounts

alexskysilk said:
I dont know what you meant when you said "done" but its not leading to that. If you're doing other stuff and would like me to help, I cant very well operate blind to it.

you should end up with 3 monitors, NONE of which is on CHC-000-Prox03, and none reporting down.

in any event, your pools are reporting healthy; if you're still having guest issues, time to move the troubleshooting to the proxmox layer. What do you see for pvesm status?

Sorry I misunderstood when you said kill the monitors. I have "Destroyed" all monitors except the ones on PVE1, CHC-000-NAS01 and CHC-000-NAS03 below are the pvesm status results and the corresponding journalctl -xe

root@pve1:~# pvesm status

mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 221104384 128 221104256 0.00%
local-zfs zfspool active 221800780 696472 221104308 0.31%
old-esxi esxi inactive 0 0 0 0.00%

root@pve1:~# journalctl -xe
Jan 02 12:41:06 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:10 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:12 pve1 kernel: libceph: mon2 (1)172.30.250.14:6789 session established
Jan 02 12:41:12 pve1 kernel: libceph: client219733965 fsid 8f47be00-ff2e-4265-8fc8-91ce4b8d1671

root@CHC-000-Prox03:~# pvesm status
storage 'old-esxi' is not online

mount error: Job failed. See "journalctl -xe" for details.
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 187087872 242688 186845184 0.13%
local-zfs zfspool active 202354716 15509440 186845276 7.66%
old-esxi esxi inactive 0 0 0 0.00%

Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Found left-over process 14487 (mount.ceph) in control group while starting unit. Ignoring.
Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: Mounting mnt-pve-Ceph\x2dFS.mount - /mnt/pve/Ceph-FS...
░░ Subject: A start job for unit mnt-pve-Ceph\x2dFS.mount has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit mnt-pve-Ceph\x2dFS.mount has begun execution.
░░
░░ The job identifier is 679.
Jan 02 12:40:56 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:56 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:57 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:57 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:57 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:58 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:58 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:59 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)

root@CHC-000-Prox04:~# pvesm status

mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 219230976 128 219230848 0.00%
local-zfs zfspool active 219270696 39728 219230968 0.02%
old-esxi esxi inactive 0 0 0 0.00%

░░ The process' exit code is 'killed' and its exit status is 15.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Failed with result 'timeout'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit mnt-pve-Ceph\x2dFS.mount has entered the 'failed' state with result 'timeout'.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Unit process 11952 (mount.ceph) remains running after unit stopped.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: Failed to mount mnt-pve-Ceph\x2dFS.mount - /mnt/pve/Ceph-FS.
░░ Subject: A start job for unit mnt-pve-Ceph\x2dFS.mount has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit mnt-pve-Ceph\x2dFS.mount has finished with a failure.
░░
░░ The job identifier is 711 and the job result is failed.
Jan 02 12:40:59 CHC-000-Prox04 pvestatd[2366]: mount error: Job failed. See "journalctl -xe" for details.
Jan 02 12:40:59 CHC-000-Prox04 corosync[2268]: [KNET ] rx: Packet rejected from 10.10.104.1:5405

root@CHC-000-ProxNAS01:~# pvesm status

mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
PBS: error fetching datastores - 500 Can't connect to 10.10.248.57:8007 (Connection timed out)
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs inactive 0 0 0 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 48890880 943744 47947136 1.93%
local-zfs zfspool active 47948496 96 47948400 0.00%
old-esxi esxi inactive 0 0 0 0.00%

Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18948]: 2025-01-02T12:40:59.828-0700 7c12676006c0 -1 osd.6 44906 heartbeat_check: no reply from 10.10.104.16:6865 osd.26 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6806 osd.11 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6822 osd.12 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6816 osd.13 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6836 osd.14 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6854 osd.15 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6828 osd.22 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6865 osd.26 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6818 osd.3 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6858 osd.20 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6829 osd.21 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6828 osd.22 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6804 osd.23 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6849 osd.24 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6847 osd.25 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6846 osd.27 ever on either front or back, first ping>
Jan 02 12:41:00 CHC-000-ProxNAS01 corosync[3252]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:00 CHC-000-ProxNAS01 kernel: libceph: mds0 (1)10.10.104.16:6861 socket closed (con state V1_BANNER)

root@CHC-000-ProxNAS03:~# pvesm status
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs active 67693301760 29489274880 38204026880 43.56%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 50863488 128 50863360 0.00%
local-zfs zfspool active 50863516 96 50863420 0.00%
old-esxi esxi inactive 0 0 0 0.00%

ceph health detail
HEALTH_WARN all OSDs are running squid or later but require_osd_release < squid; 2 daemons have recently crashed; 76 slow ops, oldest one blocked for 150 sec, mon.CHC-000-ProxNAS03-1 has slow ops
[WRN] OSD_UPGRADE_FINISHED: all OSDs are running squid or later but require_osd_release < squid
all OSDs are running squid or later but require_osd_release < squid
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T11:45:25.202782Z
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T12:39:08.220237Z
[WRN] SLOW_OPS: 76 slow ops, oldest one blocked for 150 sec, mon.CHC-000-ProxNAS03-1 has slow ops

ceph -s
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
all OSDs are running squid or later but require_osd_release < squid
2 daemons have recently crashed
76 slow ops, oldest one blocked for 200 sec, mon.CHC-000-ProxNAS03-1 has slow ops

services:
mon: 3 daemons, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,pve1 (age 2m)
mgr: pve1(active, since 14m), standbys: CHC-000-ProxNAS03, CHC-000-ProxNAS01
mds: 1/1 daemons up, 2 standby
osd: 28 osds: 28 up (since 24m), 28 in (since 2d); 37 remapped pgs

data:
volumes: 1/1 healthy
pools: 6 pools, 545 pgs
objects: 7.70M objects, 29 TiB
usage: 59 TiB used, 152 TiB / 211 TiB avail
pgs: 1295708/15397010 objects misplaced (8.415%)
505 active+clean
35 active+remapped+backfill_wait
2 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+scrubbing

io:
recovery: 12 MiB/s, 3 objects/s

Chaparral Wireless · Thursday at 20:49

One thing I am seeing is that even though the /etc/pve/ceph.conf shows the cluster network IPs 172.30.250.2 both the Monitor and Manager on PVE1 show the 10.10.104.10 IP instead. both IPs are routable so I am not sure if this is an issue but I figured I would point it out.

Chaparral Wireless · Thursday at 20:56

root@pve1:~# pvecm status
Cluster information
-------------------
Name: CHC-000-ProxMox
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 12:54:09 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000001
Ring ID: 1.16d0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.104.10 (local)
0x00000002 1 10.10.104.12
0x00000003 1 10.10.104.16
0x00000004 1 10.10.104.14
0x00000005 1 10.10.104.13

root@CHC-000-Prox03:~# pvecm status
Cluster information
-------------------
Name: CHC-000-ProxMox
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 12:54:09 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000002
Ring ID: 1.16d0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.104.10
0x00000002 1 10.10.104.12 (local)
0x00000003 1 10.10.104.16
0x00000004 1 10.10.104.14
0x00000005 1 10.10.104.13

root@CHC-000-Prox04:~# pvecm status
Cluster information
-------------------
Name: CHC-000-ProxMox
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 12:54:09 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 1.16d0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.104.10
0x00000002 1 10.10.104.12
0x00000003 1 10.10.104.16
0x00000004 1 10.10.104.14
0x00000005 1 10.10.104.13 (local)

root@CHC-000-ProxNAS01:~# pvecm status
Cluster information
-------------------
Name: CHC-000-ProxMox
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 12:54:09 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000004
Ring ID: 1.16d0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.104.10
0x00000002 1 10.10.104.12
0x00000003 1 10.10.104.16
0x00000004 1 10.10.104.14 (local)
0x00000005 1 10.10.104.13

root@CHC-000-ProxNAS03:~# pvecm status
Cluster information
-------------------
Name: CHC-000-ProxMox
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 12:54:09 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000003
Ring ID: 1.16d0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.104.10
0x00000002 1 10.10.104.12
0x00000003 1 10.10.104.16 (local)
0x00000004 1 10.10.104.14
0x00000005 1 10.10.104.13

alexskysilk · Thursday at 21:02

Chaparral Wireless said:
Jan 02 12:41:06 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:10 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:12 pve1 kernel: libceph: mon2 (1)172.30.250.14:6789 session established
Jan 02 12:41:12 pve1 kernel: libceph: client219733965 fsid 8f47be00-ff2e-4265-8fc8-91ce4b8d1671

you're still having connection issues, either due to a network misconfiguration or physical layer problems. BTW, it looks like you're comingling all your traffic on the 10.10.104.0 subnet. This is a bad idea and can/will cause you pain- but to be sure, you can post the content of a node's /etc/network/interfaces file AND MAKE SURE THE OTHERS ARE SET UP IDENTICALLY

Chaparral Wireless · Thursday at 21:31

I am not sure what you mean by comingling traffic or how to resolve this. 10.10.104.X traffic is all physically going to one set of network hardware and the 172.30.250.X traffic is all physically going to another set of network hardware.

Our overall network infrastructure is running on Mikrotik hardware and we are running OSPF.

attached are the /etc/network/interface for each of the servers.

alexskysilk · Thursday at 21:48

Here is what I would do differently:

1. make the ceph public interface the same as ceph private.
2. IF you have switch ports available, separate corosync to its own interface(s). 2 is preferrable to 1. if you have two you dont need any bonding for them. In any case, do NOT USE VMBR1 for this traffic. if you want to reuse the same physical interfaces, either give bond0 its own address/vlan for the purpose (and NOT ON the same subnet,) or just create 2 vlan subinterfaces on the children, eg:

Code:

auto enp1s0f2np2.50
iface enp1s0f2np2.50 inet static
# Corosync Ring 1
  address 10.10.50.16/24

auto enp1s0f3np3.51
iface enp1s0f3np3.51 inet static
# Corosync Ring 2
  address 10.10.51.16/24

Yes, you can have subinterfaces pass traffic seperately from a parent bond.

Having said all that, making these changes in a live environment is possible but pretty fraught. This might be more a basis for a new cluster deployment.

In the meantime, figure out why you have traffic rejected for 172.30.250.2 (whichever node that is.)

ness1602 · Thursday at 21:56

I would (and am) always using active-passive bond, and next thing i would check for some firewall rules?

Chaparral Wireless · Thursday at 21:56

Can I pop out the drives on my CEPH cluster, reinstall Proxmox all nodes, resolve these network issues, then add the drives/OSDs back and restore my VMS? Or am I going to have to rebuild everything from scratch? Unfortunately as I stated earlier we have no backups and I am addressing that internally.

alexskysilk · Thursday at 21:57

Chaparral Wireless said:
Can I pop out the drives on my CEPH cluster, reinstall Proxmox all nodes, resolve these network issues, then add the drives/OSDs back and restore my VMS?

It would be easier and safer to rebuild from scratch and restore from pbs.

edit:
I see you have no backups.

here is the good news: all your pools appear accessible and functioning. If you're not able to access them from pve, you can use rbd to backup the images. you can do it yourself from the cli, or use something like https://github.com/camptocamp/ceph-rbd-backup

you can then restore those backups to your new pool.

Chaparral Wireless · Thursday at 22:00

unfortunately we have no backups on PBS. it was built but no backups were ever done....

alexskysilk · Thursday at 22:01

gah you're fast

see my edit.

Chaparral Wireless · Thursday at 22:09

how do I access them from PVE? I am sorry to be a pain. I am still learning all the under the hood stuff in Proxmox.

No VMs with CEPH Storage will start after update to 8.3.2 and CEPH Squid

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Member

Distinguished Member

Member

Attachments

Distinguished Member

Famous Member

Member

Distinguished Member

Member

Distinguished Member

Member