Search results

  1. D

    download.proxmox.com unreachable

    Hi, We've noticed increasing occurrences where we are unable to retrieve updates from download.proxmox.com from South Africa. We get directed towards af.cdn.proxmox.com and are unable to establish a connection to the resulting IP on tcp:80 (http). [admin@backup1 ~]# host download.proxmox.com...
  2. D

    OVS Bridge Between VMs (QinQ?)

    We've been running QinQ VMs for almost 2 years, ie virtual router or firewall with VLANs with a single virtual Ethernet uplink assigned to the VM. PVE 6 simply requires a very simple change to the network initialization script, details here...
  3. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Hi, We run corosync on the vlan Ceph replicates on, on a redundant LACP channel, instead of a dedicated NIC. False positive fencing events are extremely disruptive so we continue to run with those settings in place...
  4. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    I'm also able to report 7 healthy clusters with zero false positive fencing events over the last week. We always configure Corosync to run on LACP OvS bonds so the changes @spirit recommended are perfect for our usage case (detailed here) The cluster where nodes would get fenced regularly (the...
  5. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    It ultimately, in our environments, reduces false positive events. We typically have VMs bridged on a dedicated LAG with Ceph and Corosync on another LAG. Fail over for real failure events is around 2 minutes but unnecessarily fencing nodes is massively disruptive...
  6. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    @spriit Good thinking, I scanned through the documentation on corosync 3 and my understanding is that the token timeout is automatically adjusted by the coefficient when there are 3 or more nodes, so I made the following changes: On all three nodes initially: systemctl stop pve-ha-lrm...
  7. D

    [SOLVED] pveproxy fails to load local certificate chain after upgrade to pve 6

    Was upgrading a stand alone PVE 5 to 6 today and ran in to this... To fix: rm -f /etc/pve/pve-root-ca.pem /etc/pve/priv/pve-root-ca.* /etc/pve/local/pve-ssl.*; pvecm updatecerts -f;
  8. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    We have two clusters in which we host virtual routers and firewalls. Heavy network traffic causes jitter and sometimes even packet loss with the default LACP OvS configuration so we run a sort of hybrid. The root cause is that Intel X520 network cards support receive side steering where they...
  9. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    We increase totem timeout to 10000 and yes, we're running libknet 1.11-pve2. We haven't had a false positive fencing scenario since the 13th but looking at logs indicates continuous problems so I assume it's a matter of time... Symptoms appear very similar, in that relatively minor network...
  10. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Those 3 nodes are in a client's cluster using older equipment. They were previously 3 x VMware hosts using a HP SAN device where the IT guy's predecessor had set it up as RAID-0... Irrespective, their Proxmox + Ceph cluster has been super stable for the last 2 years but plagued by frequent...
  11. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    I upgraded libknet to 1.11-pve2 and restarted corosync with debugging enabled. Was 'lucky' to have fairly recently started concurrent pings between all three nodes when a corosync host down event occurred. I hope there's something useful in these logs. kvm1 = 1.1.7.9 kvm2 = 1.1.7.10 kvm3 =...
  12. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    The following could perhaps be a separate but related bug. These logs are from a remaining node when one of the nodes dropped off. Cluster memberships updates successfully to 5/6 but notification message then start escalating with processing pause being reported. Eventually settles but I'm sure...
  13. D

    [SOLVED] Ceph 14.2.3 / 14.2.4

    Apologies, I fat fingered something. For others: pico /etc/apt/sources.list.d/ceph.list #deb http://download.proxmox.com/debian/ceph-nautilus buster main deb http://download.proxmox.com/debian/ceph-nautilus buster test apt-get update; apt-get dist-upgrade; systemctl restart ceph-target...
  14. D

    [SOLVED] Ceph 14.2.3 / 14.2.4

    Great news, just checked though and it unfortunately doesn't appear to have replicated to the mirrors yet... For those being affected by this: One can generally simply mitigate by restarting the downed OSD Ceph processes. Systemd would have tried this numerous times and failed though, so the...
  15. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Just a heads up that the second filter also matches protocol 1 (icmp) only...
  16. D

    [SOLVED] Intel S5520HC Xeon reboots

    We re-purposed some old hardware to setup a proper sandbox environment. The cluster has 4 Dell R620 servers and a relatively old Intel S5520HC system with Intel E5620 (Westmere) CPUs. This last node isn't going to get used for virtuals, primarily serving as a dedicated Ceph storage node. System...
  17. D

    [SOLVED] Ceph 14.2.3 / 14.2.4

    We are being affected by OSDs failing when another node is restarted. The issue is detailed in Ceph bug tracker entry 39693 (https://tracker.ceph.com/issues/39693). The issue has apparently been addressed and included in Ceph 14.2.3, will there be binaries in the testing repository soon...
  18. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Nothing at all, only the usual boot time initialisation messages. QLogic firmware might be newer/older than other's: [root@kvm1 ~]# ethtool -i eth0 driver: bnx2 version: 2.2.6 firmware-version: bc 5.2.3 NCSI 2.0.6 expansion-rom-version: bus-info: 0000:03:00.0 supports-statistics: yes...
  19. D

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    I don't believe so, corosync hasn't crashed on any of our nodes and switching to udpu made no difference so we're back on knet. Our small HP system cluster, which has bnx2 NICs, is the only one still experiencing regular problems...