Proxmox network drivers are all quite unstable

felipe

Well-Known Member
Oct 28, 2013
222
6
58
we changed from normal libvirt kvm to proxmox...
before we had a good runing setup. never any problem.
now we just encounter one problem after another... started with problems of the windows ide drivers..
and now all our windows computer with rtl8139 lost all their network...
now changed to e1000 ... deactivating ipv6 freezes the machine... some windows clients also suddenly hang now without doing anything...
its catastrophic run a production cluster like this....
 
strange, I never had a problem in years with windows and net drivers... try to use other and read wiki notes about windows guests...

Marco
 
Has always been a nightmare to me too for Windows guests. At the beginning tons of "strange" probelms with virtio drivers, then I went to intel e1000. Had some vm that worked fine with rtl8139 and had network freeze in more recent proxmox. Different windows version have different behaviour, in short a minefield.
But can't be a proxmox issue, since the underlying technology is KVM and Fedora virtio guest drivers, so the same technology you used with libvirt and kvm.
If you use the same kvm version with libvirt and kvm, and same guest virtio drivers, you should experience the same problems as well. If is not the case, please report exact config/versions because if there is some bug on proxmox side I'm sure developers will be happy to be able to (try to) fix.
 
we have about 50 different windows machines. the last 3 years we never had any problems with rtl8139 or any other....
now we moved to proxmox and problems started. i really like proxmox and the proxmox related stuff really works perfect...
but the kernel versions etc... don't seem that stable with the drivers like my old setup... i would never have expected this because so many people use proxmox and i thought they just use the most stable of all....
but we had 3 different kernel versions before... even different ubuntu versions.. and never any problems with rtl8139 in windows. (linux is still stable ...)
the actual kernel used by proxmox seems to have big problems with windows guests...(network)

my last try of using a newer vdriver resulted in bluescreen and the machine did not even boot anymore... i had to remove the drivers by hand.
i have just problems to argue against our windows guys.. because a simple hyper V setup works a lot mor stable than all the kvm stuff....
the sad thing it is that bevore changing to proxmox we neved had any issues....
if any developer wants to no exact data of what is not working :

pveversion -v
proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-24 (running version: 3.1-24/060bd5a6)
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-9
libpve-access-control: 3.0-8
libpve-storage-perl: 3.0-18
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1

i would like to contribute to get this more stable!
 
I would love to have details of your previous working setup, kernel version, kvm version, etc.. I.e. when proxmox had older kvm, rtl8139 in windows worked fine, when I upgraded proxmox that had a more recent kvm version, I had (have) connectivity problems, i.e. if you test with iperf after some seconds connection stops on guest side.
What puzzles me is that Proxmox uses standard and stable stuff, like red hat kernel and stable kvm code, so I dubt that what they add can be the cause of such troubles, but if you provide us more details or you are able to do more tests (i.e. in your working old setup install newer kvm? Standard RH kernel? Kernel from Proxmox? change kvm flags when run the vm? etc.) maybe developers could see if is something that they added.
 
i had
3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
and one kernel ubuntu 2.32.xxxx
all 3 worked perfect with rtl8139 ....
as i changed the server i cant check for exact kvm versions anymore... but i can setup a new server to check if the network drivers work there..

i was also excpecting the proxmox kernel/kvm more stable than other ubuntu/debian versions because of lots of testing....
but it seems that the rtl nic with actual proxmox version does not work at all with win7 guests and is very slow with win2008 guests.,. on linux i did not loose network connectivity at the moment but servers configured with the rtl nic are not very productive... but i think it is a lot more stable than windows....
generally i would say that in actual proxmox it is unstable to use the rtl card....
e1000 seems to work fine at the moment.... stress tests over hours didnt make a problem...

is there any page in the wiki to describe which proxmox version & network or disk drivers seems to run good with which linunx and windows versions? would help people out there a lot. specially some versions of virtio drivers are veeeery buggy and other work good...
thats what i really miss in proxmox (wiki). knowing which windows/linux version works perfect and stable with which proxmox version and drivers! (i think the big problem with kvm is generally windows)
 


my last try of using a newer vdriver resulted in bluescreen and the machine did not even boot anymore... i had to remove the drivers by hand.

I have had this happen once, I let windows search for the driver, it used one from the wrong sub-folder in the virtio driver ISO.
There is a handy table here to help one choose the proper folder:
http://pve.proxmox.com/wiki/Paravirtualized_Network_Drivers_for_Windows#Introduction

I use virtio for disks and e1000 for network when using Windows.
 
are the "latest" drivers stable? i know it from a lot of people and it also happened to me that virtio drivers where really unstable...
on some machines we allready use older virtio drivers....
i changed 2 weeks ago a ide driver to virtio which resulted in blue screen- reading about lots of complains in the internet since then i am little bit afraid of the virtio drivers. e1000 allways worked and had enogh performance for most guests...
i just got rid of all rtl8139 cards... but still about 20-30 machines had this drivers... not a funny work on production machines...
i was just really disappointed that the proxmox kernel is not that stable with the nics then my standard ubutu kvm before...
i love the proxmox interface and web gui.... but the kernel... ufff thats far away from what i expected... even that it is an old kernel...
like mentioned in another post today it is not possible to install win2012 remote desktop services because of the old kernel....
 
we have about 50 different windows machines. the last 3 years we never had any problems with rtl8139 or any other....
now we moved to proxmox and problems started. i really like proxmox and the proxmox related stuff really works perfect...
but the kernel versions etc... don't seem that stable with the drivers like my old setup... i would never have expected this because so many people use proxmox and i thought they just use the most stable of all....
but we had 3 different kernel versions before... even different ubuntu versions.. and never any problems with rtl8139 in windows. (linux is still stable ...)
the actual kernel used by proxmox seems to have big problems with windows guests...(network)

my last try of using a newer vdriver resulted in bluescreen and the machine did not even boot anymore... i had to remove the drivers by hand.
i have just problems to argue against our windows guys.. because a simple hyper V setup works a lot mor stable than all the kvm stuff....
the sad thing it is that bevore changing to proxmox we neved had any issues....
if any developer wants to no exact data of what is not working :

pveversion -v
proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-24 (running version: 3.1-24/060bd5a6)
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-9
libpve-access-control: 3.0-8
libpve-storage-perl: 3.0-18
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1

i would like to contribute to get this more stable!


I am running the exact same version as you, and am encountering similar issues. In my case my whole network has become destablized by adding the proxmox server. My previously stable router now crashes 3 times/week, and the proxmox server has random network slowdowns. I did the following to make proxmpox somewhat more stable:
ethtool -K eth0 rx off

The server was up for 1 week before a crash vvs 1 day before. There is nothing in the logs after a crash. I've configured 12 openvz servers , and like proxmox, and am now trying to figure out what is going on. I'm not sure if this is a networking issue or if I just have bad memory, a bad network cable or other bad hardware. I will post a solution if I find one.

My hardware is a 3-5 year old SGI rackable with 24 GB and 2 quad core AMD cpus.
 
Hi, about windows and rtl8139, they are a confirmed bug fixed in qemu git
http://git.qemu.org/?p=qemu.git;a=commit;h=00b7ade807b5ce6779ddd86ce29c5521ec5c529a

(So it should be fixed in next proxmox release with qemu 1.7).

For e1000, I really don't known, I never have had problem with it
 
I am running the exact same version as you, and am encountering similar issues. In my case my whole network has become destablized by adding the proxmox server. My previously stable router now crashes 3 times/week, and the proxmox server has random network slowdowns. I did the following to make proxmpox somewhat more stable:
ethtool -K eth0 rx off

The server was up for 1 week before a crash vvs 1 day before. There is nothing in the logs after a crash. I've configured 12 openvz servers , and like proxmox, and am now trying to figure out what is going on. I'm not sure if this is a networking issue or if I just have bad memory, a bad network cable or other bad hardware. I will post a solution if I find one.

My hardware is a 3-5 year old SGI rackable with 24 GB and 2 quad core AMD cpus.


It turns out my issue was due to bad RAM :( No more netowrking issues once the bad ram was removed.
 
Has anyone resolved this issue ? I have the exact same issue but only with Windows. CentOS and Ubuntu work great.
p
It seems like the network drivers just go to sleep. If I connect to the console and ping one of my other machines the network card starts working, and I can also ping this VM.

If I let it sit idle for 30 seconds and then try to ping the VM again it "Request timed out" on me. I can then go back in the console, ping again and the network is alive again.

It is very strange, if I log into the console and setup a persistent ping, it will keep the network running indefinitely, as soon as I stop the ping and wait a few seconds the vm nic stops responding.
 
as mentioned rtl8139 has a bug and should be fixed with the update to proxmox 3.2 (qemu 1.7) but i dont use the rtl8139 cards anymore. just change it for e1000 it works perfect...
 
Thanks for the quick response. I played around with it a little more and it seems to be related to bonding. I setup both NICs in a bond using balance-rr (my switch does not support 802.3ad) and this does not seem to work very well. I changed it to balance-alb and that seems to work just fine.
 
I see you changed your bonding and resolved this problem, just wanted to make a comment that might help future readers.

What you described is a classic symptom of a problem at layer 2 in the 7 layer burrito. http://en.m.wikipedia.org/wiki/OSI_model

Duplicated MAC or IP address, improperly configured vlans, bridges, trunks, bonds, switches, etc. Switches/servers keep track of where a particular MAC is located and what MAC is used for a particular IP address. Due to a misconfiguration the MAC cannot be resolved properly so the network stops working for the things using that MAC. When the problem MAC is used, eg sending pings from that MAC, the systems are 'reminded' where to find that MAC and the network starts working.
Most switches I have used only cache MAC data for 30 second which is why the unreachable problem occurs after about 30 second of activity.

I know this and have seen this happen numerous times and occasionally I still find myself scratching my head thinking WTF when these problems occur. It is too easy to overlook the basics when you are distracted by the cries of ten thousand users complaining they can't access the latest lolcats for five minutes. :-D