I am also getting "Connection reset by peer (596)" or "communication failure (0)" when I try to navigate to any of the Ceph Storage mounts
I dont know what you meant when you said "done" but its not leading to that. If you're doing other stuff and would like me to help, I cant very well operate blind to it.
you should end up with 3 monitors, NONE of which is on CHC-000-Prox03, and none reporting down.
in any event, your pools are reporting healthy; if you're still having guest issues, time to move the troubleshooting to the proxmox layer. What do you see for pvesm status?
Sorry I misunderstood when you said kill the monitors. I have "Destroyed" all monitors except the ones on PVE1, CHC-000-NAS01 and CHC-000-NAS03 below are the pvesm status results and the corresponding journalctl -xe
root@pve1:~# pvesm status
mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 221104384 128 221104256 0.00%
local-zfs zfspool active 221800780 696472 221104308 0.31%
old-esxi esxi inactive 0 0 0 0.00%
root@pve1:~# journalctl -xe
Jan 02 12:41:06 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:06 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:07 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 kernel: libceph: mon3 (1)172.30.250.2:6789 socket error on write
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:08 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:09 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:09 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:10 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:11 pve1 corosync[2799]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:12 pve1 kernel: libceph: mon2 (1)172.30.250.14:6789 session established
Jan 02 12:41:12 pve1 kernel: libceph: client219733965 fsid 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
root@CHC-000-Prox03:~# pvesm status
storage 'old-esxi' is not online
mount error: Job failed. See "journalctl -xe" for details.
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 187087872 242688 186845184 0.13%
local-zfs zfspool active 202354716 15509440 186845276 7.66%
old-esxi esxi inactive 0 0 0 0.00%
Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Found left-over process 14487 (mount.ceph) in control group while starting unit. Ignoring.
Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 02 12:40:56 CHC-000-Prox03 systemd[1]: Mounting mnt-pve-Ceph\x2dFS.mount - /mnt/pve/Ceph-FS...
░░ Subject: A start job for unit mnt-pve-Ceph\x2dFS.mount has begun execution
░░ Defined-By: systemd
░░ Support:
https://www.debian.org/support
░░
░░ A start job for unit mnt-pve-Ceph\x2dFS.mount has begun execution.
░░
░░ The job identifier is 679.
Jan 02 12:40:56 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:56 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:57 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:57 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:57 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
Jan 02 12:40:58 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:58 CHC-000-Prox03 corosync[2503]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:40:59 CHC-000-Prox03 kernel: libceph: mon5 (1)172.30.250.5:6789 socket closed (con state V1_BANNER)
root@CHC-000-Prox04:~# pvesm status
mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 219230976 128 219230848 0.00%
local-zfs zfspool active 219270696 39728 219230968 0.02%
old-esxi esxi inactive 0 0 0 0.00%
░░ The process' exit code is 'killed' and its exit status is 15.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Failed with result 'timeout'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support:
https://www.debian.org/support
░░
░░ The unit mnt-pve-Ceph\x2dFS.mount has entered the 'failed' state with result 'timeout'.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: mnt-pve-Ceph\x2dFS.mount: Unit process 11952 (mount.ceph) remains running after unit stopped.
Jan 02 12:40:59 CHC-000-Prox04 systemd[1]: Failed to mount mnt-pve-Ceph\x2dFS.mount - /mnt/pve/Ceph-FS.
░░ Subject: A start job for unit mnt-pve-Ceph\x2dFS.mount has failed
░░ Defined-By: systemd
░░ Support:
https://www.debian.org/support
░░
░░ A start job for unit mnt-pve-Ceph\x2dFS.mount has finished with a failure.
░░
░░ The job identifier is 711 and the job result is failed.
Jan 02 12:40:59 CHC-000-Prox04 pvestatd[2366]: mount error: Job failed. See "journalctl -xe" for details.
Jan 02 12:40:59 CHC-000-Prox04 corosync[2268]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
root@CHC-000-ProxNAS01:~# pvesm status
mount error: Job failed. See "journalctl -xe" for details.
storage 'old-esxi' is not online
PBS: error fetching datastores - 500 Can't connect to 10.10.248.57:8007 (Connection timed out)
Name Type Status Total Used Available %
Ceph-FS cephfs inactive 0 0 0 0.00%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs inactive 0 0 0 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 48890880 943744 47947136 1.93%
local-zfs zfspool active 47948496 96 47948400 0.00%
old-esxi esxi inactive 0 0 0 0.00%
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18948]: 2025-01-02T12:40:59.828-0700 7c12676006c0 -1 osd.6 44906 heartbeat_check: no reply from 10.10.104.16:6865 osd.26 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6806 osd.11 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6822 osd.12 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6816 osd.13 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6836 osd.14 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6854 osd.15 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6828 osd.22 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18958]: 2025-01-02T12:40:59.837-0700 78b4534006c0 -1 osd.9 44906 heartbeat_check: no reply from 10.10.104.16:6865 osd.26 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6818 osd.3 ever on either front or back, first ping >
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6858 osd.20 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6829 osd.21 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6828 osd.22 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6804 osd.23 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6849 osd.24 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6847 osd.25 ever on either front or back, first ping>
Jan 02 12:40:59 CHC-000-ProxNAS01 ceph-osd[18935]: 2025-01-02T12:40:59.889-0700 79c35d8006c0 -1 osd.17 44906 heartbeat_check: no reply from 10.10.104.16:6846 osd.27 ever on either front or back, first ping>
Jan 02 12:41:00 CHC-000-ProxNAS01 corosync[3252]: [KNET ] rx: Packet rejected from 10.10.104.1:5405
Jan 02 12:41:00 CHC-000-ProxNAS01 kernel: libceph: mds0 (1)10.10.104.16:6861 socket closed (con state V1_BANNER)
root@CHC-000-ProxNAS03:~# pvesm status
storage 'old-esxi' is not online
Name Type Status Total Used Available %
Ceph-FS cephfs active 67693301760 29489274880 38204026880 43.56%
NVME rbd active 4881687008 1305061600 3576625408 26.73%
PBS pbs active 64273514112 24960 64273489152 0.00%
SSD rbd active 20373375548 129780284 20243595264 0.64%
Spinner rbd active 38406609956 202583076 38204026880 0.53%
esxi esxi active 0 0 0 0.00%
local dir active 50863488 128 50863360 0.00%
local-zfs zfspool active 50863516 96 50863420 0.00%
old-esxi esxi inactive 0 0 0 0.00%
ceph health detail
HEALTH_WARN all OSDs are running squid or later but require_osd_release < squid; 2 daemons have recently crashed; 76 slow ops, oldest one blocked for 150 sec, mon.CHC-000-ProxNAS03-1 has slow ops
[WRN] OSD_UPGRADE_FINISHED: all OSDs are running squid or later but require_osd_release < squid
all OSDs are running squid or later but require_osd_release < squid
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T11:45:25.202782Z
mds.CHC-000-ProxNAS01-01 crashed on host CHC-000-ProxNAS01 at 2024-12-31T12:39:08.220237Z
[WRN] SLOW_OPS: 76 slow ops, oldest one blocked for 150 sec, mon.CHC-000-ProxNAS03-1 has slow ops
ceph -s
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
all OSDs are running squid or later but require_osd_release < squid
2 daemons have recently crashed
76 slow ops, oldest one blocked for 200 sec, mon.CHC-000-ProxNAS03-1 has slow ops
services:
mon: 3 daemons, quorum CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1,pve1 (age 2m)
mgr: pve1(active, since 14m), standbys: CHC-000-ProxNAS03, CHC-000-ProxNAS01
mds: 1/1 daemons up, 2 standby
osd: 28 osds: 28 up (since 24m), 28 in (since 2d); 37 remapped pgs
data:
volumes: 1/1 healthy
pools: 6 pools, 545 pgs
objects: 7.70M objects, 29 TiB
usage: 59 TiB used, 152 TiB / 211 TiB avail
pgs: 1295708/15397010 objects misplaced (8.415%)
505 active+clean
35 active+remapped+backfill_wait
2 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+scrubbing
io:
recovery: 12 MiB/s, 3 objects/s