Unknown status hosts .. unstable ceph .. help!

leex12 · May 25, 2022

I have a 5 node cluster which has been working fine but seems to have gone crazy a day or two after i did an update (pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.35-1-pve)).

After a while three (the same three) of the hosts go grey but are still up. Run the following commands and it comes back for a while
systemctl restart ceph.target
service pve-cluster restart
service pveproxy restart
service pvedaemon restart
service pvestatd restart

however two of the three are monitor and manager for ceph. Whilst its got the status of running the host is 'unknown'

looking at journalctl -xe i see a bunch of ceph heartbeat errors .. "osd.2 13485 heartbeat_check: no reply from 192.168.107.55" which is really odd as that IP has nothing to do with the servers

aaron · May 25, 2022

Can you post your /etc/pve/ceph.conf contents?

Do you have the firewall enabled?

Can the nodes ping each other on the IP addresses used for Ceph?

leex12 · May 25, 2022

HI,

I haven't enabled the proxmox f/w on any of the hosts. Interestingly all the hosts can ping the public address (whch is the backup network for ceph) but one node can't be reached on the primary ceph link (10.107.x.x)

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.107.0.4/16
fsid = 1e8245d2-c907-490c-98e0-cddf1c2dea80
mon_allow_pool_delete = true
mon_host = 192.168.107.8 192.168.107.2 192.168.107.1
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.107.4/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve01]
host = pve01
mds_standby_for_name = pve

[mds.pve02]
host = pve02
mds_standby_for_name = pve

[mds.pve08]
host = pve08
mds_standby_for_name = pve

[mon.pve01]
public_addr = 192.168.107.1

[mon.pve02]
public_addr = 192.168.107.2

[mon.pve08]
public_addr = 192.168.107.8

aaron · May 25, 2022

leex12 said:
but one node can't be reached on the primary ceph link (10.107.x.x)

Then check why that is. The optional ceph cluster network (the public network is the mandatory "main" ceph network) is used for the traffic between the OSDs and can take away quite some load from the public network.

Is the network config correct on that node? Verify the ip a output with the one in the /etc/network/interfaces. One possibility could be that the enumeration of the NICs changed with the new kernel due to a changed driver. In that case, you need to adapt the network config. If I remember correctly, I think there was something with the Mellanox drivers.

leex12 · May 25, 2022

More head scratching

.. so have got the ceph network back on the server1 can now ping all the other 10.107 hosts.

However some wirld stuff. So the physical obboard ethernet which isn't used had the 192.168.107.55 address that was meantioned in the heatbeat failure.

server1 and server 2 are identifical machines. their interfaces file was identifical but the ip a output not.

this is the one from server2
auto lo
iface lo inet loopback

iface enp4s0f0 inet manual

iface enp0s31f6 inet manual

iface enp4s0f1 inet manual

auto vmbr0
iface vmbr0 inet static
address 192.168.107.2/24
gateway 192.168.107.254
bridge-ports enp4s0f0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

auto vmbr1
iface vmbr1 inet static
address 10.107.0.2/16
bridge-ports enp4s0f1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

output from ip a on server 2

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 6c:2b:59:cd:5d:b8 brd ff:ff:ff:ff:ff:ff
3: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
link/ether 2c:27:d7:4f:8f:f0 brd ff:ff:ff:ff:ff:ff
4: enp4s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
link/ether 2c:27:d7:4f:8f:f4 brd ff:ff:ff:ff:ff:ff
5: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 2c:27:d7:4f:8f:f0 brd ff:ff:ff:ff:ff:ff
inet 192.168.107.2/24 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::2e27:d7ff:fe4f:8ff0/64 scope link
valid_lft forever preferred_lft forever
6: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 2c:27:d7:4f:8f:f4 brd ff:ff:ff:ff:ff:ff
inet 10.107.0.2/16 scope global vmbr1
valid_lft forever preferred_lft forever
inet6 fe80::2e27:d7ff:fe4f:8ff4/64 scope link
valid_lft forever preferred_lft forever

this is what I would expect .. server 1 however server1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
link/ether d8:9e:f3:3b:97:9a brd ff:ff:ff:ff:ff:ff
3: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
link/ether c4:34:6b:cc:7b:70 brd ff:ff:ff:ff:ff:ff
inet 192.168.107.44/24 brd 192.168.107.255 scope global noprefixroute enp4s0f0
valid_lft forever preferred_lft forever
4: enp4s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
link/ether c4:34:6b:cc:7b:74 brd ff:ff:ff:ff:ff:ff
inet 192.168.107.55/24 brd 192.168.107.255 scope global noprefixroute enp4s0f1
valid_lft forever preferred_lft forever
5: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether c4:34:6b:cc:7b:70 brd ff:ff:ff:ff:ff:ff
inet 192.168.107.1/24 scope global vmbr0
valid_lft forever preferred_lft forever
inet 192.168.107.44/24 brd 192.168.107.255 scope global secondary noprefixroute vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::1c99:1e0f:68d5:62cd/64 scope link
valid_lft forever preferred_lft forever
6: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether c4:34:6b:cc:7b:74 brd ff:ff:ff:ff:ff:ff
inet 10.107.0.1/16 scope global vmbr1
valid_lft forever preferred_lft forever
inet 192.168.107.55/24 brd 192.168.107.255 scope global noprefixroute vmbr1
valid_lft forever preferred_lft forever
inet6 fe80::e64f:3af4:b429:26cb/64 scope link
valid_lft forever preferred_lft forever

have no idea why the additoinal IP addresses have attached themselves to the NIC and bridge

don't forget I have two other 'ghost' servers which are communicating fine

aaron · May 25, 2022

So the cluster and ceph are working fine so far?

leex12 said:
their interfaces file was identifical but the ip a output not.

I hope the last octet in the IP addresses is differen

Are you talking about the 192.168.107.44 and 192.168.107.55 on node 2?

Do they show up if you reboot the node? What happens if you run ifreload -a?

Also, please post CLI output inside [code][/code] tags for better readability

leex12 · May 25, 2022

think network issue is sorted on server1! I don't think it was the correct way to achive gettting rid of the additional network stuff but diabling the DHCP service ( systemctl disable dhcpcd.service ) stops it picking up an address and now can ping on both public and ceph networks.

after a brief moment of green server1 has now gone back to gray!

however monitor/managers now looking good. Status of the host is showing correctly! The other two servers that were grey and now back to green! Didn't notice if this happened when i switched off the DHCP service

aaron · May 25, 2022

leex12 said:
but diabling the DHCP service ( systemctl disable dhcpcd.service ) stops it picking up an address

What else is installed? There should not be any service doing DHCP requests on a Proxmox VE node under normal circumstances.

leex12 · May 25, 2022

Well that makes zip sense to me .. server1 has gone green! all OSD are up and in!

aaron said:
What else is installed? There should not be any service doing DHCP requests

Nothing! I have a docker on there and a couple of desktop VMs. server1 was the only one with the service enabled!

leex12 · May 27, 2022

been running fine for over 24hrs so this feels resolved to me! Thanks @aaron for holding my hand

Search

Search

Unknown status hosts .. unstable ceph .. help!

leex12

Member

aaron

Proxmox Staff Member

leex12

Member

aaron

Proxmox Staff Member

leex12

Member

aaron

Proxmox Staff Member

leex12

Member

aaron

Proxmox Staff Member

leex12

Member

leex12

Member

We value your privacy