Connection problems after upgrade to 3.3

cdoering

New Member
Sep 15, 2011
2
0
1
Hello,

we use Proxmox on a 3 Node cluster. The nodes are connected through 2 switches (1 connected to the internet and 1 private). We have public IP adresses configured on a bridge as well as a private IP for local communication. Several VM running on the cluster. Some of the VMs use public IPs wich are routed through the Proxmox nodes (Our hoster won't allow to use a failover-subnet on VMs, that is, bind to a virtual MAC).
These settings worked fine until monday night, when I upgraded to 3.3

After a test to connect to the services on the VMs succeeded, a few hours later, connection was lost to one of the public adresses in a VM behind the Proxmox gateway. We could solve the issue by migrating the VM to the node, to which the failover subnet was bound to. Tonight, ANY public IP was not accessible and only the IPMI cards in the Server worked (wich ruled out the switch). After changing from OVS to linux bridges, the problem was fixed. A few hours later, another VM was not reachable (a reverse proxy), and only stopping the firewall within the VM (Shorewall) helped. The firewall in the VM was not misconfigured and worked since the installation, Shorewall on the system that had issues on monday is still running. Nothing in the Shorewall logs either, except the expected...

So we have network issues and every time, a different approach solved it. And they started after the upgrade. Might the new Proxmox firewall affect out networking? I have no other guesses as of now.
The PVE Firewall was not configured and i had no other FW installed on the nodes. In the Datacenter settings, die FW was disabled (on the nodes however, it is enabled, for the rules eventually activated on them, I guess) but was running. However, the iptables where empty (except a fail2ban rule) and the policies where set to ACCEPT.

I disabled and stopped the FW on all nodes manually for now.

Here is the pveversion output:
Code:
proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-5
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Here is the network config (identical on all nodes except the host IPs):
Code:
# network interface settings
auto lo
iface lo inet loopback

#allow-vmbr0 int0
#iface int0 inet static
auto vmbr0
iface vmbr0 inet static
    address 88.xxx.xxx.66
    netmask 255.255.255.224
    network 88.xxx.xxx.64
    gateway 88.xxx.xxx.65
    #ovs_type OVSIntPort
    #ovs_bridge vmbr0
    bridge_ports eth0
    bridge_stp off
    bridge_fd 0

#iface int1 inet static
iface vmbr0:1 inet static
    address 88.xxx.xxx.73
    netmask 255.255.255.248
    network 88.xxx.xxx.72
    #ovs_type OVSIntPort
    #ovs_bridge vmbr0

allow-vmbr0 eth0
iface eth0 inet manual
#    ovs_type OVSPort
#    ovs_bridge vmbr0

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

allow-vmbr1 bond0
iface bond0 inet manual
    ovs_bonds eth1 eth2 eth3
    ovs_type OVSBond
    ovs_bridge vmbr1
    ovs_options lacp=active bond_mode=balance-tcp

#auto vmbr0
#iface vmbr0 inet static
#    address 10.0.0.1
#    netmask 255.255.255.0
#    network 10.0.0.0
#    ovs_type OVSBridge
#    ovs_ports eth0 int0

auto vmbr1
iface vmbr1 inet static
    address 10.1.0.1
    netmask 255.255.255.0
    network 10.1.0.0
    ovs_type OVSBridge
    ovs_ports bond0

Orginally, there were 2 OVS bridges which connect the clusternodes through the 2 switches. After the loss of connection to all public IPs, i reverted the setup of bridge vmbr0 to linux-bridge. The commented-out is the old setting, obviously, working prior the incident. The vmbr0:1 alias (int1 respectively) is for the failover subnet and startet manually or by the fencing script, when the subnet is bound to the node. The VMs withe the rest of the public IPs in the subnet use this as their gateway. vmbr1 is the local network (for cluster communication), not been seen by the VMs itself.
ovs-vswitch.log had nothing relevant (prior to my trying to reactivate the network).
VM image storage is a ceph RBD pool, if this is relevant.

If you need some other information, let me know.

Thanks in advance, C. Doering

PS: We use Proxmox since 2011 and version 1.7 IIRC and it was very reliable so far. Thanks for the great work.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!