Search results

  1. I

    Corosync KNET Flapping

    So the node got fenced again yesterday and now I'm not seeing errors anymore. There doesn't seem to be anything more in `daemon.log`. It includes the usual flapping followed by what looks like startup logs. Monitoring corosync stats might be a good idea. I'll try and get that worked in there to...
  2. I

    Corosync KNET Flapping

    I've attached the cmap stats for the other nodes here. Node 1 is the node that fenced Node 5. The metrics up there are from Node 5.
  3. I

    Corosync KNET Flapping

    Lucky for me, I have that info. From one of the other nodes: 2023-03-31T12:50:05-04:00 service 'vm:112': state changed from 'fence' to 'recovery' 2023-03-31T12:50:05-04:00 service 'vm:109': state changed from 'fence' to 'recovery' 2023-03-31T12:50:05-04:00 node 'zorya': state changed...
  4. I

    Corosync KNET Flapping

    There are 0 errors on all nodes... There are retries across all nodes but no node is an outlier. In the logs, it always seems to be link 1 that's failing, never link 0. Looking at the logs some more, every node claims that "host 5 joined", host 5 claims every other node joined. Since this node...
  5. I

    Corosync KNET Flapping

    So this cluster has: 1. Intel Core i7-9700k w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 2. Intel Xeon E3-1240L v5 w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 3. Intel Xeon D-1521 w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+...
  6. I

    Corosync KNET Flapping

    The interfaces I have for my corosync rings are not bonded. This just started happening again today and I'm not sure what changed. I don't see any errors so far on the links but I'm seeing constant flapping. If this is related to mac address aging, why would it have such a weird pattern...
  7. I

    ceph-mgr crash when enabling perf stats

    I'm having some performance issues on CephFS I'm trying to track down. I tried enabling stats following the information on https://docs.ceph.com/en/quincy/cephfs/cephfs-top/. The second I do ceph mgr module enable stats, all of my MDSes start to puke the following: 2023-02-19T21:19:31.164-0500...
  8. I

    Corosync KNET Flapping

    Sorry, I've been on a trip for the past week or so. Anyway, thanks for the tip, I've disabled RSTP and LLDP on ports connected to the cluster nodes. We'll see if that helps.
  9. I

    Corosync KNET Flapping

    Thanks for all of your help! I do have VMs running on vmbr1, those consume ceph client traffic. Not ideal, but as the graph shows, they're not saturated in this situation and the issue is happening on vmbr2.13. vmbr2 is shared between the Ceph backside network and the secondary ring. That...
  10. I

    Corosync KNET Flapping

    The 10 GbE interfaces are not bonded, only the 1 GbE interfaces are bonded and those are used for "public" VM traffic, they're not used for Proxmox clustering or Ceph. They're using LACP and the switch reports that it's fine. Here's an example of one of the host's `/etc/network/interfaces`: ❯...
  11. I

    pvestatd[15200]: status update time

    Not to resurrect an old thread but I'm seeing these errors as well as influxdb read timeouts but `time pvesm status` always returns in less than a second: ❯ time pvesm status Name Type Status Total Used Available % cluster...
  12. I

    Corosync KNET Flapping

    Every so often (like right now), I'll start seeing a lot of KNET logs about a link going down and coming back up. Sometimes rebooting one node will fix it, sometimes it won't. It seems to happen randomly after node reboots or some other event. How can I determine which node is causing this or...
  13. I

    [SOLVED] No perf counter for schema for mon.#

    So this either went away after I finished upgrading to Quincy or after I restarted some combination of daemons as part of the upgrade.
  14. I

    [SOLVED] No perf counter for schema for mon.#

    In preparation for upgrading to Ceph Quincy (from Pacific), I had to remove and re-add three of my monitors as they were still using leveldb. Doing this through the GUI didn't work as it was somehow unable to find monitors with that ID. These are fairly old monitors so I ended up having to do...
  15. I

    What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

    Update: So one of two things made this go away: 1. Rebooting all the nodes. 2. Migrating a VM off of one of the nodes. When doing the live migration of the VM..it was going absurdly slowly, like 56k over a 10GbE link with no other traffic. After migrating that VM and rebooting, all is well...
  16. I

    What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

    Right I did think that Ceph was just saturating the link but after the rebalancing stopped and I started seeing very little traffic on the 10g link...that's when I got suspicious. I'll do some digging on how to read that stats output but if you could help me interpret it, that would be helpful...
  17. I

    What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

    We had a power outage yesterday. For unrelated reasons, 4/5 of the nodes did not power up successfully. As I powered them up, I would fix each node, which added maybe 5 minutes to each node's bootup. After this, I started seeing these messages constantly on all nodes but _only_ for the secondary...
  18. I

    [SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

    So this is now solved, I think. According to my error log, it seems like, somehow, an MDS was reporting an older version than 16.2.7. Checking the CephFS section in the Proxmox web UI, I actually had 2 MDSes that were reporting _no_ version string at all. I first did ceph mds fail...
  19. I

    [SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

    I've posted a message to the ceph mailing list here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FU6JPZNLY2PVF4ZV7PYP2KDJ4UFSVOR2/
  20. I

    [SOLVED] Update 7.0-11 to 7.1-8 Ceph issues

    Yup, I can confirm. Tried to turn it off this morning and I still get assertion failures. I wonder if I can delete and re-create each mon one at a time...

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!