Network stops working (was OSDs on one node keeps crashing)

oz1cw7yymn

Well-Known Member
Feb 13, 2019
94
13
48
Hi,

I'm having an annoying problem with a 5 node cluster. All OSDs are crashing on one node just after midnight most nights.

In syslog I can see that starting at 23:58:53 all the osds on the node start having issues with heartbeat_check with no reply from other osds. A minute later all the osds are shutdown. There are no problems starting them again in the morning and they run fine until the next midnight with another osd death, not every night.

Additionally, I have backup jobs running, I moved them to 23 instead of 00, still have the same midnight massacre. I have three networks (corosync, client and ceph) and I can ping other nodes from all three interfaces (about 0.2 ms).

Any pointers on how to troubleshoot?
 
In syslog I can see that starting at 23:58:53 all the osds on the node start having issues with heartbeat_check with no reply from other osds. A minute later all the osds are shutdown. There are no problems starting them again in the morning and they run fine until the next midnight with another osd death, not every night.

Additionally, I have backup jobs running, I moved them to 23 instead of 00, still have the same midnight massacre. I have three networks (corosync, client and ceph) and I can ping other nodes from all three interfaces (about 0.2 ms).
I still argue, that backups may interfere. Are you're sure they are finished before midnight? Can you also explain a litte more in detail how your networks are separated?
 
I switched off the backups - same issue.

I thought it was maybe lack of RAM (on the other machines in the cluster), so I've put more memory in them all - still same issue.

I'm pretty sure it's something wrong in the network configuration, but I've redone it to make sure it's correct - same issue.
Network configuration:
on each node
1 nic for pve-corosync
1 nic bond and vmbr for public (used by vms and ceph public)
1 nic bond and vmbr for ceph cluster use (still waiting for delivery of the 10GB cards)
All networks on separate switches. I'm taking out the bonding and having only vmbrs for the public and ceph networks.


I think I need to reinstall that node, but I don't that much spare time in the office, so for now I've added an hourly cron to start the osds if they go down.
 
I see in syslog on the troubled node (pve006) that at 00:58:11 corosync reports that link 1 to all other hosts are down (link 1 is the public nic), best link is 0 (which is the pve/corosync link).

I see the same in syslog on another node, that at 00:58:11 the link 1 to pve006 goes down, link 0 still up.

On pve006 osds and the mon start to panic after 20 seconds of no heartbeat.
at 00:59:20 corosync reports that link 1 is up to the other hosts
at 01:04:36 the corosync reports that link 1 is down again

On the other node I see this for the corosync link 1 to pve006
01:04:36 link down
01:05:15 link up
01:05:44 link down
01:06:12 link up
01:39:17 link down
01:39:48 link up

Looking back, I see similar pattern every night starting at the same time (less than 15 seconds difference when it starts).

I've now removed the backup job completely (not just disabled it).
 
On the other node I see this for the corosync link 1 to pve006
01:04:36 link down
01:05:15 link up
01:05:44 link down
01:06:12 link up
01:39:17 link down
01:39:48 link up
You have a network issue. Can you post your interface file?
 
It's on a secure network so I can't copy paste.

Code:
auto lo

iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

auto eno4

iface eno4 inet static
    address xx.XX.xx.6/24
#pve network

iface eno5 inet manual

iface eno6 inet manual

auto bond1
iface bond1 inet manual
    bond-slaves none
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    
auto vmbr1
iface vmbr1 inet static
    address xx.YY.xx.6/24
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
#ceph network

auto vmbr0
iface vmbr0 inet static
    address xx.ZZ.xx.6/16
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
 
What ceph_public and ceph_cluster networks are set? And where is corosync running on? Also, are these settings identical on all nodes?
 
Code:
corosync
    ring 0    eno4 (pve network for corosync)
    ring 1    vmbr0 (public network for VMs and ceph public)

ceph
    cluster_network    vmbr1 (ceph internal traffic)
    public_network    vmbr0 (public network for VMs and ceph public)

In the syslog corosync entries, only ring1 (vmbr0) goes down, ring 0 stays up.
All nodes have the same three networks (but can have different nic names etc).
 
corosync ring 0 eno4 (pve network for corosync) ring 1 vmbr0 (public network for VMs and ceph public)
Link1 has interference from other traffic on the bridge. It is strongly recommended to have Corosync run on their own physical NIC ports.

ceph cluster_network vmbr1 (ceph internal traffic) public_network vmbr0 (public network for VMs and ceph public)
Hier can also be the issue, since also corosync complains. The public network is used by all Ceph service and clients. VMs, that do not access Ceph directly are not considered clients.
1586265831334.png
https://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/
 
Link1 has interference from other traffic on the bridge. It is strongly recommended to have Corosync run on their own physical NIC ports.

I have corosync on ring0 on it's own physical nic and physical switch. I'm mostly using the corosync syslog messages to detect the issue on the other nic. However, would it be interfering to have ring1 on a shared system - i.e. is it better to not have a second ring if it is on a shared network?


Hier can also be the issue, since also corosync complains. The public network is used by all Ceph service and clients. VMs, that do not access Ceph directly are not considered clients.
View attachment 16286
https://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/

Sure, the ceph front end traffic would have interference from other vm traffic. But until I have some stability, I'm not running much on the pve systems and now I've deleted the backup job, not just disabled it.

And it's still quite strange that it happens every night at about the same time and only for one host.

Curiouser and curiouser..
 
I have corosync on ring0 on it's own physical nic and physical switch. I'm mostly using the corosync syslog messages to detect the issue on the other nic. However, would it be interfering to have ring1 on a shared system - i.e. is it better to not have a second ring if it is on a shared network?
Better a second ring, than only one. But how stable will that ring be when disaster strikes?
 
So now I have received another NIC with 4 ports. Took some time to get it working properly, but now it's working with all ports having consistent device names.

I've bonded a port on the old NIC with a port on the new NIC for both my public network and the ceph cluster network.

I still get my public network (vmbr over bond over two ports on different NICs) going down at 00:57 last night. The switch doesn't detect any connection going down. I then have about two minutes of OSDs panicking because of failed heartbeat_checks.

At this point I can only conclude that this model hardware is haunted and I need an exorcist. I'm going to try a full reinstall, but at this point I think I'm going to install HyperV on the machine and run pve as a VM piping through the disks directly. Windows Server is not experiencing any problem on an identical machine next to it. That same machine had the same issues running pve before and had to be reinstalled with Windows Server to finish an urgent project that we couldn't do with a flakey system running debian.
 
I have now reinstalled proxmox multiple times on the node (HP DL380 Gen10). I am using only the new NIC (HP with 4xe1000). My most recent installation was with 6.0.1, upgraded to latest, running mon, mgr and mds and two osds. I've also moved all connections to another switch (cisco->netgear). I've upgraded all HP firmware and drivers using the latest HP baseline.

At 00.57 (and at other times) each day, the network goes down. It comes up again after a little while (usually not fast enough for osds to not panic and go down).
When I try to ssh (putty) into the node from my desktop, the connection is refused. I can ssh into the node from another node, ping my desktop and then I can make a connection. Pinging the desktop takes a while (first six packets were lost).

Again, I have had no problem running windows server on these machines. I'm going to try to install Ubunu 20.04 and see if I get any similar issues.

Any other ideas? Grateful for any pointers, especially from debian network experts.
 
Another update in this mystery saga. I've installed pve in i Hyper-V VM on one of these HPE machines. The machine was installed with the new 6.2.1 iso, no updates. The pve installation is clean, not connected to the cluster, no ceph installed, only one virtual nic.

I started a job that prints the sate and pings another machine once every 15 seconds, piped to a log file. I checked the log file next day and sure enough - the ping failed at 00:57. Over about 24 hours, pinging every 15 seconds, I also had another 37 missed packets - but is seems like a very strange coincidence that the network failed at 00:57 - again.

I have no idea how to troubleshoot this..
 
I started a job that prints the sate and pings another machine once every 15 seconds, piped to a log file. I checked the log file next day and sure enough - the ping failed at 00:57. Over about 24 hours, pinging every 15 seconds, I also had another 37 missed packets - but is seems like a very strange coincidence that the network failed at 00:57 - again.
This sounds like some external issue, not related to PVE or Hyper-V in that case. Either some BIOS setting or switch config that interferes at that time.
 
This sounds like some external issue, not related to PVE or Hyper-V in that case. Either some BIOS setting or switch config that interferes at that time.

Thanks - I am really scratching my head! I already changed the switch (but the machines are on the same network of course) and the nic. I haven't been able to find anything in BIOS that seems relevant and it's strange that it would affect a pve installation running on bare metal and a vm under hyper-v (with it's own bios).

I will go through the bios again to see if I can find anything relevant. I'm also planning to install another os (probably Ubuntu), as I see this in pve/debian but not in Windows (as far as I can see in the event logs).
 
Thanks - I am really scratching my head! I already changed the switch (but the machines are on the same network of course) and the nic. I haven't been able to find anything in BIOS that seems relevant and it's strange that it would affect a pve installation running on bare metal and a vm under hyper-v (with it's own bios).
Are you sure that hyper-v doesn't also exhibit the disconnect? Otherwise it could be some automation software, monitoring or anything else that has access from the outside. There might be also something in dmesg, journal/syslog.
 
Hyper-V: nothing in the eventlog for Hyper-V-VmSwitch or Hyper-V-Compute

inside the vm:
dmesg -H - nothing from today
syslog - Proxmox VE replicator every minute, nothing else around 01:00
auth.log - sessions opened by cron for root every hour
pveam.log - failed update at 3:41 but nothing else
daemon.log - same as syslog

No other log files changed in that time period (it's a fresh install)
 
I started a job that prints the sate and pings another machine once every 15 seconds, piped to a log file. I checked the log file next day and sure enough - the ping failed at 00:57. Over about 24 hours, pinging every 15 seconds, I also had another 37 missed packets - but is seems like a very strange coincidence that the network failed at 00:57 - again.
Which end was not available, the local Proxmox VE or the remote machine? Did you try with Proxmox VE 6.2?
 
Which end was not available, the local Proxmox VE or the remote machine? Did you try with Proxmox VE 6.2?

It was a very simple test, pinging another host every 20 seconds and checking when the ping failed (1 packets transmitted, 0 received, 100% packet loss, time 0ms).
Yes - this was a fresh install with the 6.2-1 iso (pveversion pve-manager/6.2-4/982457a (running kernel: 5.4.34-1-pve))