Hi everyone. I have a very weird problem with an new Proxmox node I've added to our cluster.
Yesterday I made some changes to our server network. I took out one of our Gluster servers (one node of three - the other two are picking up the workload), wiped it clean, and installed a fresh copy of Proxmox 4.2 on it, to match the other Proxmox servers we have in our cluster.
Then I came back to the office and followed the procedure for adding a new Proxmox node here: https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Adding_nodes_to_the_Cluster
About this time, I started getting pages from Munin, stating that we were getting 1 "Network Error" per second on every last one of our Proxmox VMs. We've got about 20 of these. The strange part of this is that these network errors do not exist on any of the bare metal servers we have. They also do not show up on any of the bare metal of the Proxmox servers themselves. I do not see any kernel errors on any of the VMs, so this is more than a little difficult to diagnose. Typically, if this were happening on a bare-metal server, it might be a problem with the network cabling or the ethernet switch, but, well, that's not exactly a physical hardware item here.
I tried shutting down the new Proxmox server last night to see if that changed anything, and now Munin is reporting a slightly higher rate of RX errors: around 1.8 per second now.
Below is an illustration of the network errors we're seeing from the Linux "ifconfig" command:
First, on the Proxmox server "Cloud4":
On one of the VMs running on that server:
And on one of the bare-metal servers that have nothing to do with any of this, but is connected to the same physical network:
Any help in diagnosing this problem is greatly appreciated.
Yesterday I made some changes to our server network. I took out one of our Gluster servers (one node of three - the other two are picking up the workload), wiped it clean, and installed a fresh copy of Proxmox 4.2 on it, to match the other Proxmox servers we have in our cluster.
Then I came back to the office and followed the procedure for adding a new Proxmox node here: https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Adding_nodes_to_the_Cluster
About this time, I started getting pages from Munin, stating that we were getting 1 "Network Error" per second on every last one of our Proxmox VMs. We've got about 20 of these. The strange part of this is that these network errors do not exist on any of the bare metal servers we have. They also do not show up on any of the bare metal of the Proxmox servers themselves. I do not see any kernel errors on any of the VMs, so this is more than a little difficult to diagnose. Typically, if this were happening on a bare-metal server, it might be a problem with the network cabling or the ethernet switch, but, well, that's not exactly a physical hardware item here.
I tried shutting down the new Proxmox server last night to see if that changed anything, and now Munin is reporting a slightly higher rate of RX errors: around 1.8 per second now.
Below is an illustration of the network errors we're seeing from the Linux "ifconfig" command:
First, on the Proxmox server "Cloud4":
ernied@cloud4:~$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:1d:09:2b:c5:da
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2769032275 errors:0 dropped:0 overruns:0 frame:0
TX packets:3786890182 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2870084235027 (2.6 TiB) TX bytes:4276347463672 (3.8 TiB)
eth0 Link encap:Ethernet HWaddr 00:1d:09:2b:c5:da
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2769032275 errors:0 dropped:0 overruns:0 frame:0
TX packets:3786890182 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2870084235027 (2.6 TiB) TX bytes:4276347463672 (3.8 TiB)
On one of the VMs running on that server:
ernied@mysql:~$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 36:64:32:30:61:39
inet addr:XX.XX.XXX.XXX Bcast:XX.XX.XXX.255 Mask:255.255.255.0
inet6 addr: fe80::3464:32ff:fe30:6139/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:303503289 errors:69226 dropped:0 overruns:0 frame:69226
TX packets:404917626 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:29967628990 (29.9 GB) TX bytes:582132621147 (582.1 GB)
eth0 Link encap:Ethernet HWaddr 36:64:32:30:61:39
inet addr:XX.XX.XXX.XXX Bcast:XX.XX.XXX.255 Mask:255.255.255.0
inet6 addr: fe80::3464:32ff:fe30:6139/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:303503289 errors:69226 dropped:0 overruns:0 frame:69226
TX packets:404917626 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:29967628990 (29.9 GB) TX bytes:582132621147 (582.1 GB)
And on one of the bare-metal servers that have nothing to do with any of this, but is connected to the same physical network:
ernied@backup:~$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 08:60:6e:86:7d:d6
inet addr:XX.XX.XXX.XXX Bcast:XX.XX.XXX.255 Mask:255.255.255.0
inet6 addr: fe80::a60:6eff:fe86:7dd6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4282334058 errors:0 dropped:17856 overruns:0 frame:0
TX packets:875337342 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2629409024 (2.4 GiB) TX bytes:1544229148 (1.4 GiB)
Interrupt:41 Base address:0x4000
eth0 Link encap:Ethernet HWaddr 08:60:6e:86:7d:d6
inet addr:XX.XX.XXX.XXX Bcast:XX.XX.XXX.255 Mask:255.255.255.0
inet6 addr: fe80::a60:6eff:fe86:7dd6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4282334058 errors:0 dropped:17856 overruns:0 frame:0
TX packets:875337342 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2629409024 (2.4 GiB) TX bytes:1544229148 (1.4 GiB)
Interrupt:41 Base address:0x4000
Any help in diagnosing this problem is greatly appreciated.