Upgraded to this:
All nodes are the same version. Running a fully licensed cluster.
On boot a node will seem to cycle ceph osds in and out of service, monitor/manager on one node is down and I can't re-add it. We can't keep the cephFS mounts up so we can't load any VMs.
Syslog seems to report that osd heartbeat is trying to use the ceph 'public' network - is this right? Shouldn't it use the ceph cluster network? Public heartbeats are failing, and I'm not sure why so I'll try to figure that out - but I don't want this to be on the public network.
So what to do? I've been cruising the forums to try and fix this, but I'm really scratching my head.
Here's my ceph config:
And you can see that the osd's are trying to get heartbeat over the public network:
Help!
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
All nodes are the same version. Running a fully licensed cluster.
On boot a node will seem to cycle ceph osds in and out of service, monitor/manager on one node is down and I can't re-add it. We can't keep the cephFS mounts up so we can't load any VMs.
Syslog seems to report that osd heartbeat is trying to use the ceph 'public' network - is this right? Shouldn't it use the ceph cluster network? Public heartbeats are failing, and I'm not sure why so I'll try to figure that out - but I don't want this to be on the public network.
So what to do? I've been cruising the forums to try and fix this, but I'm really scratching my head.
Here's my ceph config:
Code:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.128.16.0/24
fsid = 2c88d85e-8a28-4cdc-800e-1979903a8d09
mon_allow_pool_delete = true
mon_host = 10.128.16.11 10.128.16.12 10.128.16.10
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.128.18.0/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.VAN3PM1]
host = VAN3PM1
mds standby for name = pve
[mds.VAN3PM2]
host = VAN3PM2
mds_standby_for_name = pve
[mds.VAN3PM3]
host = VAN3PM3
mds_standby_for_name = pve
[mon.VAN3PM1]
public_addr = 10.128.16.10
[mon.VAN3PM2]
public_addr = 10.128.16.11
[mon.VAN3PM3]
public_addr = 10.128.16.12
And you can see that the osd's are trying to get heartbeat over the public network:
Code:
Aug 10 14:36:46 VAN3PM1 ceph-osd[38621]: 2023-08-10T14:36:46.066-0700 7fdf06b8c700 -1 osd.36 8808 heartbeat_check: no reply from 10.128.18.12:6810 osd.34 ever on either front or back, first ping sent 2023-08-10T14:33:17.768929-0700 (oldest deadline 2023-08-10T14:33:37.768929-0700)
Help!
Last edited: