Plugged proxmox server into switch - breaks other heartbeat servers

squeeb · Aug 6, 2012

Hi Guys,

I have a very strange issue here that hopefully some of you know the answer to.

I have 2 x proxmox 2.1 servers.
They have two interfaces each, eth0 is bridged to vmbr0 and eth1 is just a peer to peer link to the other proxmox server via crossover cable.

Server 1:
Hostname vz1
eth0: raw interface, no IP
vmbr0: 192.168.1.1/24
eth1: 10.99.99.1/30

Server 2:
Hostname vz2
eth0: raw interface, no IP
vmbr0: 192.168.1.2/24
eth1: 10.99.99.2/30

they work fine on their own switch, but i moved them into our datacenter with a couple of servers running Heartbeat v1 providing high availability NFS.

Fileserver 1:
Hostname: NFS1
eth0: 192.168.1.10/24
eth0:1 192.168.1.20/24 (floating IP, heartbeat configured)
eth1: 10.99.99.1/30

Fileserver 2:
Hostname: NFS2
eth0: 192.168.1.11/24
eth1: 10.99.99.2/30

The two file servers are in multicast group 239.0.0.1 on eth0

As soon as I plug either proxmox box into the same network as the file servers, Heartbeat on the file servers says it lost connectivity to the other fileserver and tries to take over the resources (Both of them do this at the same time and all manor of hell occurs).

as soon as I remove the proxmox servers from the network physically, heartbeat on the file servers resumes normally and everything is fine.

Here are the steps I have tried so far:

* Reconfigured proxmox's cluster multicast IP to 239.0.0.2 on eth1 so it's completely separate from heartbeat on the file servers
* Stopped cman on both proxmox servers
* Plugged one proxmox server in at a time
* Removed IP addresses on both proxmox server's vmbr0 interface and plugged them in one at a time.

With each of the above steps, heartbeat on the file servers decided it had lost contact with it's other partner and tried to assume the resources.

I have run out of ideas here.

I have a hunch it's something to do with the bridge interface, perhaps interfering with multicast traffic.

Initially the switch they were plugged into (A Cisco 3560) was configured so that their ports were vlan trunks with their native vlan set to the 192.168.1.0/24 subnet.
I set the ports to access ports instead to no avail.
I also disabled spanning tree on those ports to see if that was the issue, still no avail.

Can somebody shed some light on this issue? I would be very grateful

Regards,
Squeeb

e100 · Aug 7, 2012

I run some heartbeat servers on same network as Proxmox and heartbeat VMs in Proxmox, never had a conflict like that.

Just a wild guess, but maybe there is a conflicting MAC address on the Proxmox nodes?

squeeb · Aug 7, 2012

Definitely not that, I one of the first things I checked.

squeeb · Aug 7, 2012

Interestingly, both of the file servers can ping each other constantly while the failure is occurring - this would indicate it's a problem with multicast traffic on the subnet being modified or altered by the proxmox boxes when they are plugged into the network, I'm sure it's something to do with the bridged interface but I have no idea how to test this.

squeeb · Aug 7, 2012

Ok, some more progress.

I set up a couple of old Dell 1850's in a DRBD / Heartbeat v1 / NFS configuration just like we do at the datacenter and they worked fine.

I then booted up a single Proxmox 2.1 server and attached it to the same switch on the same vlan / subnet and watched the HA log files on the drbd servers.

As soon as the interface became live on the proxmox servers, both HA servers flipped out thinking that their respective partner had disconnected. .

INTERESTINGLY though, after a couple of minutes they returned to a active/standby mode as they should do under normal circumstances.

I'm sure it's something to do with the bridge configuration.. nnnngggg!!!

tom · Aug 8, 2012

do you use identical IP multicast addresses somewhere?

squeeb · Aug 9, 2012

Not on this subnet no. The HA file servers are using eth0,239.0.0.1

I configured /etc/pve/cluster.conf to use 239.0.0.2

I verified that the new multicast address was being used by restarting and then doing netstat -lpnu and sure enough, corosync was listening on 239.0.0.2

However this didn't help and the problem occurred even when corosync isn't running and no multicast address is present in netstat.

tom · Aug 10, 2012

switch to unicast to check if its related to IP multicast (see http://pve.proxmox.com/wiki/Multicast_notes#Use_unicast_instead_of_multicast)

Search

Search

Plugged proxmox server into switch - breaks other heartbeat servers

squeeb

New Member

e100

Renowned Member

squeeb

New Member

squeeb

New Member

squeeb

New Member

tom

Proxmox Staff Member

squeeb

New Member

tom

Proxmox Staff Member