Network errors and web interface hangs with PVE 1.7 kernel 2.6.35 on Dell servers

alain

Renowned Member
May 17, 2009
223
2
83
France/Paris
Hi all,

I have a cluster with four Dell servers, and am seeing network errors and experiencing web interface hangs on two of them of the last generation (PE R710 and PE R610). These errors do not appear on the other two, older servers (PE 2950 and PE 2900).
PE R710 is the master node, and when I try to get access for exemple to the hardware tab of a VM, it hangs.

This is what I just obtained in web interface:
Code:
[3223]ERR:  24:  Error in Perl code: 500 read timeout
Here are the errors I see in dmesg:
Code:
 connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4349549670, last ping 4349550171, now 4349550672
 connection1:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4349549745, last ping 4349550246, now 4349550747
 connection2:0: detected conn error (1011)
There are also errors in syslog:
Code:
Feb  1 06:29:31 srv-kvm1 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4351052228, last ping 4351052729, now 4351053230
Feb  1 06:29:31 srv-kvm1 kernel: connection1:0: detected conn error (1011)
Feb  1 06:29:32 srv-kvm1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Feb  1 06:29:45 srv-kvm1 kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4351053640, last ping 4351054141, now 4351054642
Feb  1 06:29:45 srv-kvm1 kernel: connection2:0: detected conn error (1011)
Feb  1 06:29:46 srv-kvm1 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)
I should add that I have two iSCSI storages, connected on interface eth1 of each servers.
It seems, in syslog, that the errors are related to iSCSI, but the web interface should not hang.

As I said, I see no problem with the other two nodes.

I wonder if it could be a problem with the ethernet controllers included in PE R710 and R610 ?
Code:
# lspci
....
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
In PE 2950, I have NetXtreme II BCM5708...

All servers are running PVE 1.7 with kernel 2.6.35-1 (only KVM) :
Code:
srv-kvm1:/var/log# pveversion -v
pve-manager: 1.7-10 (pve-manager/1.7/5323)
running kernel: 2.6.35-1-pve
proxmox-ve-2.6.35: 1.7-9
pve-kernel-2.6.35-1-pve: 2.6.35-9
pve-kernel-2.6.24-8-pve: 2.6.24-16
qemu-server: 1.1-28
pve-firmware: 1.0-10
libpve-storage-perl: 1.0-16
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-10
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.13.0-3
ksm-control-daemon: 1.0-4
Anyone can give me a clue on this annoying problem ?

Thanks.

Alain
 
I should add that I have no problem to connect to the faulty servers using ssh, and that I cannot upload an iso image via web interface, but can do it via sFTP (Filezilla). Weird...

Alain
 
Another information, bnx2 module is indeed loaded (the right driver for this controller ?):
Code:
srv-kvm1:/var/log# lsmod
...
bnx2                   66313  0
 
the timeout occurs if there is any problem with the storage. (btw, for the 2.x release this will be fixed).

do you have a change to test with our 2.6.32er?
 
full details with:
Code:
modinfo bnx2
 
full details with:
Code:
modinfo bnx2
Code:
srv-kvm1:~# modinfo bnx2
filename:       /lib/modules/2.6.35-1-pve/kernel/drivers/net/bnx2.ko
firmware:       bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw
firmware:       bnx2/bnx2-rv2p-09-5.0.0.j10.fw
firmware:       bnx2/bnx2-mips-09-5.0.0.j15.fw
firmware:       bnx2/bnx2-rv2p-06-5.0.0.j3.fw
firmware:       bnx2/bnx2-mips-06-5.0.0.j6.fw
version:        2.0.15
license:        GPL
description:    Broadcom NetXtreme II BCM5706/5708/5709/5716 Driver
author:         Michael Chan <mchan@broadcom.com>
srcversion:     345D026A402B57D0DBC89E8
alias:          pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias:          pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias:          pci:v000014E4d00001639sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias:          pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
vermagic:       2.6.35-1-pve SMP mod_unload modversions
parm:           disable_msi:Disable Message Signaled Interrupt (MSI) (int)

I see nothing suspicious there...
 
the timeout occurs if there is any problem with the storage. (btw, for the 2.x release this will be fixed).

do you have a change to test with our 2.6.32er?

Yes, I can give it a try, at least on the R610, a new server, where there is currently no VM ruinning. I'll let you know.

Thanks for the anwers.

Alain
 
I tried with 2.6.32 kernel on the R610, and the problem seems to not appear on this node with this kernel...

srv-kvm3:~# pveversion -v
pve-manager: 1.7-10 (pve-manager/1.7/5323)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.7-30
pve-kernel-2.6.32-4-pve: 2.6.32-30
qemu-server: 1.1-28
pve-firmware: 1.0-10
libpve-storage-perl: 1.0-16
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-10
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.13.0-3
ksm-control-daemon: 1.0-4

So perhaps a driver problem with 2.6.35-1 in PVE 1.7 ? I did not noticed these problems before I upgraded from 1.6 to 1.7 two or three weeks ago...

Alain
 
Or a problem with my new node... I noticed that since I installed 2.6.32 kernel on this node, I do not see anymore errors on the master node (R710), which stayed with the 2.6.35 kernel...

Strange...

Thanks for the hint !

Alain
 
After the night, still no errors, neither on the node (R610) where I installed the 2.6.32 kernel, nor the other one (R710), still with the the 2.6.35 kernel. So I tried to reinstall the 2.6.35 kernel on the R610, and after reboot, no more errors with this kernel, at least until now...

So, it appears as a momentary problem. Curious...

Alain
 
I come back to this thread, because today, I tried to add a VM on my new node (R610), using the iSCSI storage. Until now, I had had no more error messages...

As soon as I added the machine to the iSCSI storage (not even started it), the network error messages re-appeared. And, as the first time, the same messages appeared on my master node (R710, same familiy), but not the other two (PE 2950 and PE 2900). I tried to reboot the R610, using the 2.6.32-4 kernel, but the same errors occured.
I went inside BIOS (in fact UEFI), all four interfaces are TOE and iSOE (iSCSI Offload Engine), but I don't see anything wrong.

I migrated the new VM on another node, PE2950, still on iSCSI storage, and now no more error messages on R610 or R710, using 2.6.35.

So the problem does not depend on the kernel version, 2.6.35 or 2.6.32. It is really weird and annoying. It depends only if I try to access the iSCSI storage using the R610 node...

I wonder if I should try the 2.6.18 kernel...

Alain
 
Just a follow up and an end to this thread. I just discovered a stupid error : I put the same IP address, 192.168.100.2, on the iSCSI interface (eth1) for the two nodes where the errors were happening : the PE R610 and R710. It is just a copy/paste error.

My only excuse is that I did not see in the logs any messages saying that IP was already in use. And error messages only appeared when I tried to use the iSCSI interface on the second node...

Sorry for the disturbance.

All is working now, and it is not related to the device Broadcom NetXtreme II BCM5709, or the module bnx2.

My apologies again for not seeing this obvious mistake before.
 
thanks for feedback!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!