We've been using the same network configuration for about 2 years now with no changes, we use Open vSwitch. This issue appears to be intermittent, but sometimes on reboot, the bridge does not come up with the requested MTU size of 8996, instead it uses the default of 1500. This also means our 'vlan55' used for ceph traffic also doesn't come up with the right MTU which causes major issues with ceph since the other nodes have the proper MTU.
Our eth0 and eth1 interfaces do show the proper MTU. Unfortunately I did not capture the status of the bond0 interface as I was in a hurry to fix the issue. Our configuration is identical to Example #2 in the pve wiki for openvswitch, except used an MTU of 8996 due to intel NIC limitations.
A reboot seems to solve the issue.
We are running proxmox 4.2 from the no-subscription repo, with packages about 2 weeks old. Has anyone else hit this issue? We've seen the issue 4 times now.
Also, one thing I'm not 100% sure about is if the MTU issue is actually a boot-time issue or not. We had a node that only had a couple of VMs on it that hadn't been rebooted in a couple of weeks that started going wonky the other day, upon investigation, the MTU was showing as 1500. I find it hard to believe that we wouldn't have experienced an issue earlier, but it is a very low load cluster ... so its possible. I sure hope its a boot-time issue and not that the MTU somehow changes over time.
We're adding a nagios check for this condition so hopefully we can provide more information the next time it happens.
Our eth0 and eth1 interfaces do show the proper MTU. Unfortunately I did not capture the status of the bond0 interface as I was in a hurry to fix the issue. Our configuration is identical to Example #2 in the pve wiki for openvswitch, except used an MTU of 8996 due to intel NIC limitations.
A reboot seems to solve the issue.
We are running proxmox 4.2 from the no-subscription repo, with packages about 2 weeks old. Has anyone else hit this issue? We've seen the issue 4 times now.
Also, one thing I'm not 100% sure about is if the MTU issue is actually a boot-time issue or not. We had a node that only had a couple of VMs on it that hadn't been rebooted in a couple of weeks that started going wonky the other day, upon investigation, the MTU was showing as 1500. I find it hard to believe that we wouldn't have experienced an issue earlier, but it is a very low load cluster ... so its possible. I sure hope its a boot-time issue and not that the MTU somehow changes over time.
We're adding a nagios check for this condition so hopefully we can provide more information the next time it happens.