Hello Forum,
Lately we had an outage of one network port in our ceph-meshed-cluster-network presumably due to
a kernel problem (pls. see syslog below), which finally led to not responding osds.
The cluster network is implemented as a dedicated 10Gib bond in broadcast mode (bond0 - slaves ens2, ens3, intel dual port nic).
After rebooting the nodes and updating to the latest proxmox version,
cluster an ceph are healthy again.
Just to better understand the case and the current situation, I have some questions:
1. Any idea what exactly caused the nic to be removed (pls. check logs below 'ixgbe 0000:0e:00.0: Adapter removed')?
2. Is there an established method to monitor the health status of bonded cluster-network?
3. Can somebody pls. elaborate on the current dmesg entries below?
3.1: Why is the bond0 trying to enslave nics (ens2, ens3) which have a down link (pls. see dmesg below)?
3.2: What does the message 'IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready' mean?
3.3: Why does this happen anyway, because we use IPv4 only?
3.4: Why does the bond0 toggle between 'link status definitely up' and 'down' several times?
3.5: Is there anything I can do to futher clean it up?
4. Since the outage occured at 02:30 in the morning the sylog file was flooded by
ceph-osd heartbeat_check messages up to more than a gigabyte.
What is the original logging source (it says ceph-osd) and can it be configured to mute
the excessive repetion of the same message?
[pve-cluster-configuration]:
Proxmox-hyper-converged-ceph-cluster (3 nodes)
dedicated
# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-2
...
pve-kernel-5.15.85-1-pve: 5.15.85-1
...
ceph: 16.2.11-pve1
ceph-fuse: 16.2.11-pve1
corosync: 3.1.7-pve1
...
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
...
pve-cluster: 7.3-2
...
# ceph --version
ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)
[syslog]:
Mar 10 02:30:53 amcvh11 kernel: [12120460.405620] NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
Mar 10 02:30:53 amcvh11 kernel: [12120460.405629] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P IO 5.15.60-2-pve #1
Mar 10 02:30:53 amcvh11 kernel: [12120460.405633] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
Mar 10 02:30:53 amcvh11 kernel: [12120460.405634] RIP: 0010:mwait_idle_with_hints.constprop.0+0x48/0x90
...
Mar 10 02:30:53 amcvh11 kernel: [12120460.420436] ixgbe 0000:0e:00.0: Adapter removed
Mar 10 02:30:55 amcvh11 kernel: [12120462.405919] bond0: (slave ens3): link status definitely down, disabling slave
...
Mar 10 02:31:14 amcvh11 ceph-osd[2898]: 2023-03-10T02:31:14.243+0100 7f2dad793700 -1 osd.12 127024 heartbeat_check: no reply from 192.168.227.13:6848 osd.0 since back 2023-03-10T02:30:51.993505+0100 front 2023-03-10T02:31:12.996404+0100 (oldest deadline 2023-03-10T02:31:13.693102+0100)
Mar 10 02:31:14 amcvh11 ceph-osd[2906]: 2023-03-10T02:31:14.451+0100 7f58319a6700 -1 osd.6 127024 heartbeat_check: no reply from 192.168.227.13:6860 osd.17 since back 2023-03-10T02:30:49.063724+0100 front 2023-03-10T02:31:13.566681+0100 (oldest deadline 2023-03-10T02:31:14.363492+0100)
Mar 10 02:31:14 amcvh11 ceph-osd[2903]: 2023-03-10T02:31:14.631+0100 7fb067f0d700 -1 osd.45 127024 heartbeat_check: no reply from 192.168.227.13:6848 osd.0 since back 2023-03-10T02:30:48.970796+0100 front 2023-03-10T02:31:10.076248+0100 (oldest deadline 2023-03-10T02:31:14.270084+0100)
...
[dmesg-messages]
# cat dmesg | grep bond0
[ 18.025523] bond0: (slave ens2): Enslaving as an active interface with a down link
[ 18.477517] bond0: (slave ens3): Enslaving as an active interface with a down link
[ 22.972934] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 22.972948] bond0: active interface up!
[ 22.972964] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 23.492875] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 904.110614] bond0: (slave ens2): link status definitely down, disabling slave
[ 909.542564] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1153.043135] bond0: (slave ens2): link status definitely down, disabling slave
[ 1158.467068] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1170.982978] bond0: (slave ens2): link status definitely down, disabling slave
[ 1176.206818] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1649.768203] bond0: (slave ens3): link status definitely down, disabling slave
[ 1655.308230] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 1925.932392] bond0: (slave ens3): link status definitely down, disabling slave
[ 1931.256357] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 1941.168199] bond0: (slave ens3): link status definitely down, disabling slave
[ 1946.396166] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2541.891895] bond0: (slave ens3): link status definitely down, disabling slave
[ 2547.219800] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2817.652117] bond0: (slave ens3): link status definitely down, disabling slave
[ 2823.184052] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2832.683882] bond0: (slave ens3): link status definitely down, disabling slave
[ 2838.007855] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
Any help is highly appreciated!
Lately we had an outage of one network port in our ceph-meshed-cluster-network presumably due to
a kernel problem (pls. see syslog below), which finally led to not responding osds.
The cluster network is implemented as a dedicated 10Gib bond in broadcast mode (bond0 - slaves ens2, ens3, intel dual port nic).
After rebooting the nodes and updating to the latest proxmox version,
cluster an ceph are healthy again.
Just to better understand the case and the current situation, I have some questions:
1. Any idea what exactly caused the nic to be removed (pls. check logs below 'ixgbe 0000:0e:00.0: Adapter removed')?
2. Is there an established method to monitor the health status of bonded cluster-network?
3. Can somebody pls. elaborate on the current dmesg entries below?
3.1: Why is the bond0 trying to enslave nics (ens2, ens3) which have a down link (pls. see dmesg below)?
3.2: What does the message 'IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready' mean?
3.3: Why does this happen anyway, because we use IPv4 only?
3.4: Why does the bond0 toggle between 'link status definitely up' and 'down' several times?
3.5: Is there anything I can do to futher clean it up?
4. Since the outage occured at 02:30 in the morning the sylog file was flooded by
ceph-osd heartbeat_check messages up to more than a gigabyte.
What is the original logging source (it says ceph-osd) and can it be configured to mute
the excessive repetion of the same message?
[pve-cluster-configuration]:
Proxmox-hyper-converged-ceph-cluster (3 nodes)
dedicated
# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-2
...
pve-kernel-5.15.85-1-pve: 5.15.85-1
...
ceph: 16.2.11-pve1
ceph-fuse: 16.2.11-pve1
corosync: 3.1.7-pve1
...
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
...
pve-cluster: 7.3-2
...
# ceph --version
ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)
[syslog]:
Mar 10 02:30:53 amcvh11 kernel: [12120460.405620] NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
Mar 10 02:30:53 amcvh11 kernel: [12120460.405629] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P IO 5.15.60-2-pve #1
Mar 10 02:30:53 amcvh11 kernel: [12120460.405633] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
Mar 10 02:30:53 amcvh11 kernel: [12120460.405634] RIP: 0010:mwait_idle_with_hints.constprop.0+0x48/0x90
...
Mar 10 02:30:53 amcvh11 kernel: [12120460.420436] ixgbe 0000:0e:00.0: Adapter removed
Mar 10 02:30:55 amcvh11 kernel: [12120462.405919] bond0: (slave ens3): link status definitely down, disabling slave
...
Mar 10 02:31:14 amcvh11 ceph-osd[2898]: 2023-03-10T02:31:14.243+0100 7f2dad793700 -1 osd.12 127024 heartbeat_check: no reply from 192.168.227.13:6848 osd.0 since back 2023-03-10T02:30:51.993505+0100 front 2023-03-10T02:31:12.996404+0100 (oldest deadline 2023-03-10T02:31:13.693102+0100)
Mar 10 02:31:14 amcvh11 ceph-osd[2906]: 2023-03-10T02:31:14.451+0100 7f58319a6700 -1 osd.6 127024 heartbeat_check: no reply from 192.168.227.13:6860 osd.17 since back 2023-03-10T02:30:49.063724+0100 front 2023-03-10T02:31:13.566681+0100 (oldest deadline 2023-03-10T02:31:14.363492+0100)
Mar 10 02:31:14 amcvh11 ceph-osd[2903]: 2023-03-10T02:31:14.631+0100 7fb067f0d700 -1 osd.45 127024 heartbeat_check: no reply from 192.168.227.13:6848 osd.0 since back 2023-03-10T02:30:48.970796+0100 front 2023-03-10T02:31:10.076248+0100 (oldest deadline 2023-03-10T02:31:14.270084+0100)
...
[dmesg-messages]
# cat dmesg | grep bond0
[ 18.025523] bond0: (slave ens2): Enslaving as an active interface with a down link
[ 18.477517] bond0: (slave ens3): Enslaving as an active interface with a down link
[ 22.972934] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 22.972948] bond0: active interface up!
[ 22.972964] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 23.492875] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 904.110614] bond0: (slave ens2): link status definitely down, disabling slave
[ 909.542564] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1153.043135] bond0: (slave ens2): link status definitely down, disabling slave
[ 1158.467068] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1170.982978] bond0: (slave ens2): link status definitely down, disabling slave
[ 1176.206818] bond0: (slave ens2): link status definitely up, 10000 Mbps full duplex
[ 1649.768203] bond0: (slave ens3): link status definitely down, disabling slave
[ 1655.308230] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 1925.932392] bond0: (slave ens3): link status definitely down, disabling slave
[ 1931.256357] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 1941.168199] bond0: (slave ens3): link status definitely down, disabling slave
[ 1946.396166] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2541.891895] bond0: (slave ens3): link status definitely down, disabling slave
[ 2547.219800] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2817.652117] bond0: (slave ens3): link status definitely down, disabling slave
[ 2823.184052] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
[ 2832.683882] bond0: (slave ens3): link status definitely down, disabling slave
[ 2838.007855] bond0: (slave ens3): link status definitely up, 10000 Mbps full duplex
Any help is highly appreciated!