Hello,
we use Proxmox on a 3 Node cluster. The nodes are connected through 2 switches (1 connected to the internet and 1 private). We have public IP adresses configured on a bridge as well as a private IP for local communication. Several VM running on the cluster. Some of the VMs use public IPs wich are routed through the Proxmox nodes (Our hoster won't allow to use a failover-subnet on VMs, that is, bind to a virtual MAC).
These settings worked fine until monday night, when I upgraded to 3.3
After a test to connect to the services on the VMs succeeded, a few hours later, connection was lost to one of the public adresses in a VM behind the Proxmox gateway. We could solve the issue by migrating the VM to the node, to which the failover subnet was bound to. Tonight, ANY public IP was not accessible and only the IPMI cards in the Server worked (wich ruled out the switch). After changing from OVS to linux bridges, the problem was fixed. A few hours later, another VM was not reachable (a reverse proxy), and only stopping the firewall within the VM (Shorewall) helped. The firewall in the VM was not misconfigured and worked since the installation, Shorewall on the system that had issues on monday is still running. Nothing in the Shorewall logs either, except the expected...
So we have network issues and every time, a different approach solved it. And they started after the upgrade. Might the new Proxmox firewall affect out networking? I have no other guesses as of now.
The PVE Firewall was not configured and i had no other FW installed on the nodes. In the Datacenter settings, die FW was disabled (on the nodes however, it is enabled, for the rules eventually activated on them, I guess) but was running. However, the iptables where empty (except a fail2ban rule) and the policies where set to ACCEPT.
I disabled and stopped the FW on all nodes manually for now.
Here is the pveversion output:
Here is the network config (identical on all nodes except the host IPs):
Orginally, there were 2 OVS bridges which connect the clusternodes through the 2 switches. After the loss of connection to all public IPs, i reverted the setup of bridge vmbr0 to linux-bridge. The commented-out is the old setting, obviously, working prior the incident. The vmbr0:1 alias (int1 respectively) is for the failover subnet and startet manually or by the fencing script, when the subnet is bound to the node. The VMs withe the rest of the public IPs in the subnet use this as their gateway. vmbr1 is the local network (for cluster communication), not been seen by the VMs itself.
ovs-vswitch.log had nothing relevant (prior to my trying to reactivate the network).
VM image storage is a ceph RBD pool, if this is relevant.
If you need some other information, let me know.
Thanks in advance, C. Doering
PS: We use Proxmox since 2011 and version 1.7 IIRC and it was very reliable so far. Thanks for the great work.
we use Proxmox on a 3 Node cluster. The nodes are connected through 2 switches (1 connected to the internet and 1 private). We have public IP adresses configured on a bridge as well as a private IP for local communication. Several VM running on the cluster. Some of the VMs use public IPs wich are routed through the Proxmox nodes (Our hoster won't allow to use a failover-subnet on VMs, that is, bind to a virtual MAC).
These settings worked fine until monday night, when I upgraded to 3.3
After a test to connect to the services on the VMs succeeded, a few hours later, connection was lost to one of the public adresses in a VM behind the Proxmox gateway. We could solve the issue by migrating the VM to the node, to which the failover subnet was bound to. Tonight, ANY public IP was not accessible and only the IPMI cards in the Server worked (wich ruled out the switch). After changing from OVS to linux bridges, the problem was fixed. A few hours later, another VM was not reachable (a reverse proxy), and only stopping the firewall within the VM (Shorewall) helped. The firewall in the VM was not misconfigured and worked since the installation, Shorewall on the system that had issues on monday is still running. Nothing in the Shorewall logs either, except the expected...
So we have network issues and every time, a different approach solved it. And they started after the upgrade. Might the new Proxmox firewall affect out networking? I have no other guesses as of now.
The PVE Firewall was not configured and i had no other FW installed on the nodes. In the Datacenter settings, die FW was disabled (on the nodes however, it is enabled, for the rules eventually activated on them, I guess) but was running. However, the iptables where empty (except a fail2ban rule) and the policies where set to ACCEPT.
I disabled and stopped the FW on all nodes manually for now.
Here is the pveversion output:
Code:
proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-5
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
Here is the network config (identical on all nodes except the host IPs):
Code:
# network interface settings
auto lo
iface lo inet loopback
#allow-vmbr0 int0
#iface int0 inet static
auto vmbr0
iface vmbr0 inet static
address 88.xxx.xxx.66
netmask 255.255.255.224
network 88.xxx.xxx.64
gateway 88.xxx.xxx.65
#ovs_type OVSIntPort
#ovs_bridge vmbr0
bridge_ports eth0
bridge_stp off
bridge_fd 0
#iface int1 inet static
iface vmbr0:1 inet static
address 88.xxx.xxx.73
netmask 255.255.255.248
network 88.xxx.xxx.72
#ovs_type OVSIntPort
#ovs_bridge vmbr0
allow-vmbr0 eth0
iface eth0 inet manual
# ovs_type OVSPort
# ovs_bridge vmbr0
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
allow-vmbr1 bond0
iface bond0 inet manual
ovs_bonds eth1 eth2 eth3
ovs_type OVSBond
ovs_bridge vmbr1
ovs_options lacp=active bond_mode=balance-tcp
#auto vmbr0
#iface vmbr0 inet static
# address 10.0.0.1
# netmask 255.255.255.0
# network 10.0.0.0
# ovs_type OVSBridge
# ovs_ports eth0 int0
auto vmbr1
iface vmbr1 inet static
address 10.1.0.1
netmask 255.255.255.0
network 10.1.0.0
ovs_type OVSBridge
ovs_ports bond0
Orginally, there were 2 OVS bridges which connect the clusternodes through the 2 switches. After the loss of connection to all public IPs, i reverted the setup of bridge vmbr0 to linux-bridge. The commented-out is the old setting, obviously, working prior the incident. The vmbr0:1 alias (int1 respectively) is for the failover subnet and startet manually or by the fencing script, when the subnet is bound to the node. The VMs withe the rest of the public IPs in the subnet use this as their gateway. vmbr1 is the local network (for cluster communication), not been seen by the VMs itself.
ovs-vswitch.log had nothing relevant (prior to my trying to reactivate the network).
VM image storage is a ceph RBD pool, if this is relevant.
If you need some other information, let me know.
Thanks in advance, C. Doering
PS: We use Proxmox since 2011 and version 1.7 IIRC and it was very reliable so far. Thanks for the great work.