Lost link on Dell r720xd with Intel driver (Proxmox 2.2)

  • Thread starter Thread starter pstoneman
  • Start date Start date
P

pstoneman

Guest
Hi all

We have a couple of Dell r720xd's operating as a proxmox cluster with DRBD and so on. We've got the Intel 4x1GigE Dell daughter card in the servers, with two interfaces bonded for 'public' and two bonded (balance-rr) for drbd syncronisation.

Last night I upgraded them (one at a time) to Proxmox 2.2. Everything returned stable, drbd re-synchronised, all was good. A couple of hours later (after I'd gone home!), I saw this in the logs on both boxes: "kernel: igb: eth3 NIC Link is Down", followed 20 minutes later by "kernel: igb: eth2 NIC Link is Down". Obviously, drbd then failed, and the whole world ended :-) When I logged on this morning, I saw eth2 on server1 in half-duplex mode, with no link on eth2 on server2. I saw server3 with eth3 in half-duplex, with eth3 on server1 with no link. I rebooted the servers, and everything came back as normal.

There was nothing in any log file, in dmesg, or in the iDRAC/OMSA log to indicate why the network cards failed, so I'm a bit stumped as to why it happened. They're right next to each other, with eth2-eth2 and eth3-eth3 both via 1m brand-new network cables. They've been working and totally stable for a few months prior to last night. Since then (about 12 hours ago), everything's been stable, and there's been a normal amount of drbd traffic going over the backend interfaces, with no blips in dmesg/syslog.

Does anyone have any idea how I can start investigating it? I'm at a loss!

Thanks!

Phil
 
So this just happened again tonight. Just applied the latest proxmox updates and rebooted. Lo and behold, about 4 hours later, both crossover'd NICs lost link, then one came back properly, but one came back at half duplex (the other end had no link). ifconfig eth2 down; ifconfig eth2 up fixed.

In the meantime, drbd had detected split-brain due to the loss of networking (although I've now recovered it)

Any idea why the NICs might have died - or what a good way to start debugging this would be?

Thanks...