Network drops on new VMs, not old

darthjarjar

New Member
Dec 7, 2015
4
1
1
I had several virtual machines running on an Ubuntu server with KVM. Same install I normally use but for some reason the network kept dropping on new KVM guests. Not the old guests, approx 10 of them, just the new ones. Went through every possible fix I could find and couldn't resolve the problem. So, I decided to upgrade to Proxmox since I had heard a lot of good things about it.

Everything works great, except I'm having the same problem. First 10 virtual machines, a mix of Centos 6.5, Ubuntu 14.04.3 server, and on Xubuntu desktop, work gloriously. Total CPU usage is almost always low, network is rarely ever hit hard, small bursts and one big transfer at night. New KVM guests, and an LXC container I tried, drop after 10-15 minutes of inactivity. Just now I was trying to install CDR-Stats on the container and the network froze when it got about 75% of the way done on the install script. Both the Proxmox host and the guest were running normally, network only down on the new guest. No other guests experiencing any problems, none of them were using a lot of resources at the time, actually really low resource utilization at the time since the developers are all in a meeting.

So, either I have a hardware issue or I don't understand what is going on.

Host: Intel MB w/2 Xeon X5650, 48GB RAM, 6x250 SSD hardware RAID 5, dual onboard Intel Gigabit Server NICs, Netgear ProSafe managed switches. Only network change I can think of outside of the default installation was to add the second NIC to the bridge using the web gui.

Any ideas where I can start looking for problems?
 
Hi
Do the network work after a guest reboot ?
Are you on Proxmox 4.0 ?
 
Hi
Do the network work after a guest reboot ?
Are you on Proxmox 4.0 ?


Yes, network on guest works perfectly on guest after reboot. Until it drops again. Proxmox 4.0 brand new install from the Proxmox ISO.

This has happened on multiple guests, and different versions of Linux. Assuming the problem might be caused by my configuration on the guest I downloaded the templates for the Ubuntu 14.04-1, Debian 8.0, and Centos 7 LXC containers and tried all of them. Same problem. Network just drops. Everything else on the guest works fine. Other guests on the system work fine. Host works fine. It is like I hit a wall at 10-11 guests, after that anything that gets added has this problem.

Since I don't see any obvious configuration issue that is causing this, and the problem has remained after switching operating systems (Ubuntu 14.04 w/KVM, to Proxmox 4.0 VE) I am going to assume the problem is probably hardware related at this point. I am going to install a couple of Intel CT gigabit adapters, disconnect the onboard, and reconfigure the bridge during an outage window Saturday morning. If that doesn't resolve it I am going move them to a different switch.
 
Hi
Thank you for the feedback.
Also when the network drops in the guest, do you think anything related in the host dmesg ? Like bridge changing its forwarding state (I assume you're bridging)
Which NICs where you using on the Host ?
 
Hi
Thank you for the feedback.
Also when the network drops in the guest, do you think anything related in the host dmesg ? Like bridge changing its forwarding state (I assume you're bridging)
Which NICs where you using on the Host ?

Amusingly, the only thing in /var/log/dmesg is " (Nothing has been logged yet.)". I went through this on the previous system before trying Proxmox and couldn't find anything in the logs that would indicate what the cause of the problem is. Syslog shows "port 13(veth112i0) entered disabled state" (similar entries for other guest) when the network drops but that doesn't give me any indication of what is causing the problem. A quick grep of the log directory for "vmbr0", "eth0" and "eth1" doesn't really show anything unusual - other than the grep return for "eth0" doesn't show any of those "...disabled state" messages, all of them are on "eth1". For reference, I'm including the interfaces file. Again, nothing changed on this, I did the easy install and tried to leave the setup as default as possible just to see if the problem I was experiencing on Ubuntu was a misconfiguration problem.

/etc/network/interfaces

Code:
auto lo

iface lo inet loopback


iface eth0 inet manual


iface eth1 inet manual


auto vmbr0
iface vmbr0 inet static
        address  192.168.200.43
        netmask  255.255.255.0
        gateway  192.168.200.1
        bridge_ports eth0 eth1
        bridge_stp off
        bridge_fd 0
 
I'm facing this issue too. Looks like it comes only on KVM guests, LXC guests are OK.
My setup is vmbr0 on eth1 connected to the switch trunk port. On this bridge I set up eth0 and eth1 for KVM guest on VLAN10 and VLAN20.

eth0 on VLAN10 has public IP
eth0 on VLAN20 has private IP

When it comes to such a drop I'm unable to ping public IP, unable to access ssh, unable to access any single open port (the involved guests are SMTP and IMAP servers) on PUBLIC network.
But I can still log in via private IP. I thought there would be a general traffic drop on public network, but netstat shows traffic on public network during such a drop, tail -f mail.log shows postfix processing mail from the internet.
This drop sometime takes 10 seconds, sometime 30 seconds. Guest reboot fixes it immediatelly.

Nothing special in logs.

I've tried several things, for example changing NIC driver from virtio to E1000, disabling firewall, tuning IPv4 via sysctl, disabling TX/RX offloading and blah blah blah... nothing helps. Again - LXC guests with the same network setup don't have this issue.
 
We bought a 3500€ server. I told them i will use proxmox, because of the great features and now i'm screwed, because we are using still our old bare metal server.
 
oh my god, my boss would kill me :D but how do we solve this network issues .... is there a work around?

Sure. Dedicated hardware.

No drops between CEPH nodes on dedicated NICs, no drops between Proxmox management network on dedicated NICs, no drops on LXC hosts, no KVM drops on private VLAN network on bridge vmbr0 and finally huge KVM network drops on public VLAN network on bridge vmbr0.

Proxmox guys... I would post you every info/log/config. I am not the only one who is suffering with a strange network issue.
 
strange network issue.

Debugging network issues is hard (or impossible) via forum questions and answers. But as soon as you identify a bug in the software stack, we try to fix it.

Try to find a test case where you can easily produce the issue.
 
I wanna stay on proxmox, because i simply love it. Jesus what guest PC's are they Windows Server Machines? I'm asking myself if the error would occur if i you windows server 2008 instead of 2012. Other solution could be to run our server on a linux maschine until the windows server problem is fixed. The fastest solution would be to use WinServer2012 on bare metal, but then i loose all the great features of proxmox and had the need for a new home for our linux guest clients.

tom, what should i do to identify the network problems? TCPDUMP wasn't very helpful. At least i'm new in analyzing network data.
 
Debugging network issues is hard (or impossible) via forum questions and answers. But as soon as you identify a bug in the software stack, we try to fix it.

Try to find a test case where you can easily produce the issue.

Looks like network drops stopped since two weeks ago - I'm not sure which update caused this, but HELL I don't care.
I did any config nor hardware change, only apt-get upgrade.
 
I thought I had updated this, but it looks like I did not.

After going round and round with this issue I installed two new Intel CT NIC cards and stopped using the onboard. Outside of one guest that dropped once last week, which could be an issue with the guest of course, I have not had a repeat of the problem. I am under the assumption that something was wrong with the onboard adapters or the kernel simply didn't like them.
 
  • Like
Reactions: GadgetPig
Hi Jesus

This is interesting. What what was the model of the onbord network card, and which driver was associated with it ?
You can see this with:
lspci -k
Look for the ethernet string in the output

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V
Subsystem: ASUSTeK Computer Inc. Device 85c4
Kernel driver in use: e1000e

Was is a realtek NIC ?
 
Hi Jesus

This is interesting. What what was the model of the onbord network card, and which driver was associated with it ?
You can see this with:
lspci -k
Look for the ethernet string in the output

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V
Subsystem: ASUSTeK Computer Inc. Device 85c4
Kernel driver in use: e1000e

Was is a realtek NIC ?
Hi Jesus

This is interesting. What what was the model of the onbord network card, and which driver was associated with it ?
You can see this with:
lspci -k
Look for the ethernet string in the output

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V
Subsystem: ASUSTeK Computer Inc. Device 85c4
Kernel driver in use: e1000e

Was is a realtek NIC ?

No Realtek, only Intel

04:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
Subsystem: Super Micro Computer Inc Device 1528
Kernel driver in use: ixgbe
04:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
Subsystem: Super Micro Computer Inc Device 1528
Kernel driver in use: ixgbe
81:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc Device 0656
Kernel driver in use: igb
81:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc Device 0656
Kernel driver in use: igb

I've had drops on I350.
 
Hum Ok then it might point at a hardware problem or maybe a 'igb' driver problem.
Without anything in the log it's difficult to pinpoint.
Which kernel are you running ? What is the ouput of modinfo 'igb' ?
 
Hum Ok then it might point at a hardware problem or maybe a 'igb' driver problem.
Without anything in the log it's difficult to pinpoint.
Which kernel are you running ? What is the ouput of modinfo 'igb' ?

This is the current kernel. I can't give you output of the previous one...

root@sm01:~# modinfo igb
filename: /lib/modules/4.2.6-1-pve/kernel/drivers/net/ethernet/intel/igb/igb.ko
version: 5.2.18-k
license: GPL
description: Intel(R) Gigabit Ethernet Network Driver
author: Intel Corporation, <e1000-devel@lists.sourceforge.net>
srcversion: 6536D0B3C85B1DF8FC4A9DA
alias: pci:v00008086d000010D6sv*sd*bc*sc*i*
alias: pci:v00008086d000010A9sv*sd*bc*sc*i*
alias: pci:v00008086d000010A7sv*sd*bc*sc*i*
alias: pci:v00008086d000010E8sv*sd*bc*sc*i*
alias: pci:v00008086d00001526sv*sd*bc*sc*i*
alias: pci:v00008086d0000150Dsv*sd*bc*sc*i*
alias: pci:v00008086d000010E7sv*sd*bc*sc*i*
alias: pci:v00008086d000010E6sv*sd*bc*sc*i*
alias: pci:v00008086d00001518sv*sd*bc*sc*i*
alias: pci:v00008086d0000150Asv*sd*bc*sc*i*
alias: pci:v00008086d000010C9sv*sd*bc*sc*i*
alias: pci:v00008086d00000440sv*sd*bc*sc*i*
alias: pci:v00008086d0000043Csv*sd*bc*sc*i*
alias: pci:v00008086d0000043Asv*sd*bc*sc*i*
alias: pci:v00008086d00000438sv*sd*bc*sc*i*
alias: pci:v00008086d00001516sv*sd*bc*sc*i*
alias: pci:v00008086d00001511sv*sd*bc*sc*i*
alias: pci:v00008086d00001510sv*sd*bc*sc*i*
alias: pci:v00008086d00001527sv*sd*bc*sc*i*
alias: pci:v00008086d0000150Fsv*sd*bc*sc*i*
alias: pci:v00008086d0000150Esv*sd*bc*sc*i*
alias: pci:v00008086d00001524sv*sd*bc*sc*i*
alias: pci:v00008086d00001523sv*sd*bc*sc*i*
alias: pci:v00008086d00001522sv*sd*bc*sc*i*
alias: pci:v00008086d00001521sv*sd*bc*sc*i*
alias: pci:v00008086d0000157Csv*sd*bc*sc*i*
alias: pci:v00008086d0000157Bsv*sd*bc*sc*i*
alias: pci:v00008086d00001538sv*sd*bc*sc*i*
alias: pci:v00008086d00001537sv*sd*bc*sc*i*
alias: pci:v00008086d00001536sv*sd*bc*sc*i*
alias: pci:v00008086d00001533sv*sd*bc*sc*i*
alias: pci:v00008086d00001539sv*sd*bc*sc*i*
alias: pci:v00008086d00001F45sv*sd*bc*sc*i*
alias: pci:v00008086d00001F41sv*sd*bc*sc*i*
alias: pci:v00008086d00001F40sv*sd*bc*sc*i*
depends: ptp,dca,i2c-algo-bit
intree: Y
vermagic: 4.2.6-1-pve SMP mod_unload modversions
parm: max_vfs:Maximum number of virtual functions to allocate per physical function (uint)
parm: debug:Debug level (0=none,...,16=all) (int)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!