pvecm & ceph seem happy, but still no quorum in GUI - how do I resync?

athompso · Jul 15, 2015

I have 9 nodes in a cluster: pve1 through pve9.

"pvecm nodes" on all members show all 9 members as "Members".
"pvecm status" on all members show the membership as "Cluster-Member".
"ceph -s" shows the 9-member CEPH cluster as being healthy.

In the GUI and in pvesh, however, I see a different story.
pve1,2,4,6,7 & 8 show pve1,2,4,6,7 & 8 as online (green) and pve3,5,9 as offline (red).
pve3, 5 and 9 all show *only themselves* as online in the GUI.

All the VMs that are running continue to run normally, but now I can't migrate VMs, among other things.

So... firstly, what process controls whether the GUI thinks a member is in the cluster or not? Obviously pvecm and the GUI have two completely different views of reality.
Secondly, how do I cause the two views of reality to re-sync?

I'm reluctant to reboot the offending nodes, because usually rebooting a node is what *causes* this situation, and I have reboot every single node in the cluster to fully recover. Or sometimes it just recovers spontaneously on its own overnight.

Obviously this is a bug or defect somewhere - but is it in in pvecm, the GUI, or what? Or is there some intermediate layer that has no visibility whatsoever?

I have completely disabled IGMP; I can even see the IGMP packets at my workstation - which, obviously is not ideal, but that's what I had to do to make the cluster work reliably.

There is ONE anomaly I can see... "pvecm nodes", while showing all nodes as online and members, does not always agree on the "Inc" column. But I can't find any documentation telling me what that value means. (The value for the node where I'm running it is always dramatically lower than the others... but that's even on the subset of nodes that are still talking to each other properly.)

Help!

Thanks,
-Adam Thompson
athompso@athompso.net

soholingo · Jul 15, 2015

Excellent question as I have seen this behaviour with a three node setup.

wolfgang · Jul 15, 2015

Hi,
try to restart follow services on the red nodes.
pvestatd
pvedaemon
pveproxy

athompso · Jul 15, 2015

wolfgang said:
Hi,
try to restart follow services on the red nodes.
pvestatd
pvedaemon
pveproxy

Done. No visible effect.

athompso · Jul 15, 2015

Restarting those services on *all* cluster nodes resulted in: no change at all. The same 3 nodes are ~offline to PVE, but not to pvecm or ceph or anything else.

mir · Jul 15, 2015

Are the clocks synchronized on your servers?

athompso · Jul 15, 2015

One other possible clue... in /var/log/debug, there's an endless stream of UDP checksum error messages on pve{1,2,3,4,6} but not on the others. The network on each is configured identically. Pve{1,2,3,4} are identical hardware [Dell C6100], pve{5,6} are identical hardware [Dell R710], and pve {8,9} are identical hardware [Dell PE2950]. I haven't played around with NIC offloading, it's still whatever the default is.

pve1 reports bad csum for pve2 & pve4.
pve2 reports bad csum for pve4 only.
pve3 reports bad csum for pve2 only.
pve4 reports bad csum for pve1, pve2 & pve5.
pve6 reports bad csum for pve2 & pve4.

WTF????

/var/log/messages shows that yet a different subset of hosts are sitting spinning on corosync retransmit lists.

athompso · Jul 15, 2015

Oh, no... this looks like it might be related to http://openvswitch.org/pipermail/discuss/2014-May/013856.html, since I am using OVSIntPorts.

Code:

root@pve1:/var/log# cat /etc/network/interfaces 
# network interface settings
allow-vmbr0 mgmt158
iface mgmt158 inet static
        address  192.168.158.31
        netmask  255.255.255.0
        gateway  192.168.158.1
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=158

allow-vmbr0 ceph157
iface ceph157 inet static
        address  192.168.157.31
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=157

auto lo
iface lo inet loopback

iface eth1 inet manual

iface eth0 inet manual

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bonds eth0 eth1
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_options lacp=active bond_mode=balance-tcp

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 ceph157 mgmt158

and

Code:

root@pve1:/var/log# ovs-vsctl show
0e3b4265-911c-4e8f-9fae-5a19ab628828
    Bridge "vmbr0"
        Port "tap102i0"
            tag: 158
            Interface "tap102i0"
        Port "tap104i0"
            Interface "tap104i0"
        Port "vmbr0"
            Interface "vmbr0"
                type: internal
        Port "tap117i0"
            Interface "tap117i0"
        Port "mgmt158"
            tag: 158
            Interface "mgmt158"
                type: internal
        Port "tap101i0"
            tag: 158
            Interface "tap101i0"
        Port "ceph157"
            tag: 157
            Interface "ceph157"
                type: internal
        Port "tap100i0"
            Interface "tap100i0"
        Port "bond0"
            Interface "eth0"
            Interface "eth1"
    ovs_version: "2.3.1"

athompso · Jul 15, 2015

I've only found one other possible reference to this problem so far, from http://download.openvz.org/kernel/branches/rhel6-2.6.32-testing/042stab016.1/kernel.spec

* Wed Jul 14 2010 Aristeu Rozanski <arozansk@redhat.com> [2.6.32-49.el6]
- [edac] i7core_edac: Avoid doing multiple probes for the same card (Mauro Carvalho Chehab) [604564]
- [edac] i7core_edac: Properly discover the first QPI device (Mauro Carvalho Chehab) [604564]
- [usb] Disable XHCI (USB 3) HCD module autoloading (Matthew Garrett) [608343]
- [fs] xfs: prevent swapext from operating on write-only files (Jiri Pirko) [605162] {CVE-2010-2226}
- [powerpc] Add symbols to kernel to allow makedumpfile to filter on ppc64 (Neil Horman) [611710]
- [net] netfilter: add CHECKSUM target (Michael S. Tsirkin) [605555]
- [security] audit: dynamically allocate audit_names when not enough space is in the names array (Eric Paris) [586108]
- [pci] iommu/intel: Disable IOMMU for graphics if BIOS is broken (Adam Jackson) [593516]
- [virt] stop vpit before irq_routing freed (Gleb Natapov) [612648]
- [netdrv] Allow for BCM5709S to dump vmcore via NFS (John Feeney) [577809]
- [netdrv] igb: drop support for UDP hashing w/ RSS (Stefan Assmann) [613782]
- [netdrv] mac80211: remove wep dependency (John Linville) [608704]
- [mm] fix swapin race conditions (Andrea Arcangeli) [606131]
- [crypto] authenc: Add EINPROGRESS check (Stanislaw Gruszka) [604611]
- [fs] inotify: don't leak user struct on inotify release (Stanislaw Gruszka) [592399 604611]
- [x86] amd: Check X86_FEATURE_OSVW bit before accessing OSVW MSRs (Stanislaw Gruszka) [604611]
- [kernel] profile: fix stats and data leakage (Stanislaw Gruszka) [604611]
- [sound] ice1724: Fix ESI Maya44 capture source control (Stanislaw Gruszka) [604611]
- [mm] hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of OOM-killer (Stanislaw Gruszka) [604611]
- [dma] dma-mapping: fix dma_sync_single_range_* (Stanislaw Gruszka) [604611]
- [hwmon] hp_accel: fix race in device removal (Stanislaw Gruszka) [604611]
- [net] ipv4: udp: fix short packet and bad checksum logging (Stanislaw Gruszka) [604611]

athompso · Jul 15, 2015

Yes. They're all pointing to a common, local stratum-2 NTP server... oh, wait, they're not. They're still within 0.1sec, since Ceph isn't complaining incessantly, but I'll go fix that now.

mir · Jul 15, 2015

athompso said:
One other possible clue... in /var/log/debug, there's an endless stream of UDP checksum error messages on pve{1,2,3,4,6} but not on the others. The network on each is configured identically. Pve{1,2,3,4} are identical hardware [Dell C6100], pve{5,6} are identical hardware [Dell R710], and pve {8,9} are identical hardware [Dell PE2950]. I haven't played around with NIC offloading, it's still whatever the default is.

pve1 reports bad csum for pve2 & pve4.
pve2 reports bad csum for pve4 only.
pve3 reports bad csum for pve2 only.
pve4 reports bad csum for pve1, pve2 & pve5.
pve6 reports bad csum for pve2 & pve4.

This could indicate nic problems. For starters I would suggest that you disable hardware checksum offloading. Use ethtool for that.

athompso · Jul 15, 2015

Really? Intermittent NIC problems on several nodes simultaneously? That all come and go at about the same times? I'd be more inclined to look for problems on the switch. However, based on the OVS multicast thread that Dietmar started last year (linked, above), I'm guessing it's an OVS/Kernel/Driver bug.

Also, I just rebooted pve1, pve5, pve7 and pve9 (slowly, in sequence) and suddenly I have quorum in the GUI again and can manage the cluster. Which is good, because rebooting pve2 or pve3 would have shut down the entire organization for an hour.

Like I said, sometimes it just magically comes back to life...

I will try fiddling with the NIC offload settings, though, and see if that makes any difference.

mir · Jul 15, 2015

But if there is a driver problem you should see on several nodes. Driver problems with nics usually influence hardware acceleration which offloading requires.

athompso · Jul 15, 2015

I've turned off ALL possible offloading on the NICs, and am still seeing the UDP checksums. Again, based on Proxmox's (Dietmar's) own discoveries with OVS and OVSIntPorts, I'm comfortable blaming OVS+OVSIntPort+Kernel+Driver+Multicast+UDP for the checksum problem.

The question then becomes, is that OVSIntPort UDP multicast checksum problem the root cause of (or contributing to) the PVE issues? Or is it something else, still?

athompso · Jul 16, 2015

I can confirm that this is at least part of the problem.
See http://openvswitch.org/pipermail/discuss/2015-July/018242.html for more details.

So:
PVE + OVS + OVSIntPort + cluster communications on that OVSIntPort + multicast + UDP + *running* VM with tap-mode network interface = problems.

Wow... I have a colleague who attracts corner cases like a magnet, now I know how he feels

.

soholingo · Jul 16, 2015

Thanks for sharing that answer. I have a server with 4 1 gig nics that I was about to bond into one link, and use OVS to segment the traffic of ceph/cluster/VM.
Looks like I will need a dedicated port for each. I can use OVS for the VM environment...

Thanks again!

athompso · Jul 17, 2015

It looks like, based on Dietmar's original OVS bugreport, that having the management port on the UNTAGGED bridge itself, rather than on an OVSIntPort, might work fine. I'll be testing that here soon, but not today!

soholingo · Jul 17, 2015

That makes sense. But my thought would be to build the bridge w/o an IP and then use a tagged OVSintPort to do the mgmt, but the more I think about it I like the idea of putting IP on the bridge, and leaving all the communication to the VMS to be managed by the OVSintPorts and tagged...

athompso · Jul 21, 2015

The problem with not using OVSIntPort for proxmox management is when you have VMs that run in the same VLAN/subnet as the proxmox hosts themselves... then you have to keep track of which VLAN is the untagged VLAN. I'm scared to use that VLAN Tag port on the vmbr0 interface, because I have no idea how it affects bridge operation, and I don't have a test lab right now.

athompso · Jul 22, 2015

Well, that bites... I switched management to vmbr0 itself instead of an OVSIntPort, and the problems continue. Thus my results contradict Dietmar's original problem report, in that it's not tied to an OVSIntPort.
It *is* still tied to having a running tap interface attached to vmbr0, however.

I'll be reverting all the way back to Linux Bridging, I think, because my cluster being this fragile can't be tolerated.

pvecm & ceph seem happy, but still no quorum in GUI - how do I resync?

Member

New Member

Proxmox Retired Staff

Member

Member

Famous Member

Member

Member

Member

Member

Famous Member

Member

Famous Member

Member

Member

New Member

Member

New Member

Member

Member