Thank you for your reply. The error reoccurred roughly at 08:00 EST (US).
At that time the syslog for the server I was looking at shows 5500-plus log entries.
4900 of them are variations on "Jan 31 08:02:18 ceph-02 ceph-osd[2430]: 2024-01-31T08:02:18.796-0500 7f3a4ba29700 -1 osd.8 30673 heartbeat_check: no reply from 172.29.6.21:6828 osd.16 since back 2024-01-31T08:02:18.643637-0500 front 2024-01-31T08:01:47.109720-0500 (oldest deadline 2024-01-31T08:02:12.073673-0500)"
another 180 are like "Jan 31 08:02:18 ceph-02 ceph-osd[2406]: 2024-01-31T08:02:18.844-0500 7fb8e6f52700 -1 osd.12 30673 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.258494491.0:1269682 3.a1 3:853015dc:::rbd_data.760c617b626a33.0000000000000809:head [write 1159168~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e30668)"
There are also several instances of messages like:
Jan 31 08:02:40 ceph-02 corosync[1587]: [KNET ] link: host: 6 link: 0 is down
Jan 31 08:02:40 ceph-02 corosync[1587]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 31 08:02:40 ceph-02 corosync[1587]: [KNET ] host: host: 6 has no active links
All of these suggest to me that there's a network error.
My /etc/network/interfaces file looks like this:
Code:
auto lo
iface lo inet loopback
auto ens4f0np0
iface ens4f0np0 inet manual
auto ens4f1np1
iface ens4f1np1 inet manual
iface enxb03af2b6059f inet manual
iface eno1 inet manual
iface eno2 inet manual
auto bond0
iface bond0 inet manual
bond-slaves ens4f0np0 ens4f1np1
bond-miimon 100
bond-mode balance-alb
auto vmbr0
iface vmbr0 inet static
address 172.29.6.22/17
gateway 172.29.0.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
[endcode]
ens4f0np0 and ens4f1np1 are both 10g/sec fibers
/proc/net/bonding/bond0 looks like:
[code]
root@ceph-02:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v5.15.131-2-pve
Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: ens4f1np1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: ens4f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 4
Permanent HW addr: e4:3d:1a:d6:c9:80
Slave queue ID: 0
Slave Interface: ens4f1np1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 4
Permanent HW addr: e4:3d:1a:d6:c9:81
Slave queue ID: 0
Neither of the two adapters has entries in syslog near the time these events occurred. The last events that the adapters logged were on 30 Jan when the fiber switches had a firmware update applied[/code]