Recent content by ikogan

  1. I

    Some client(s) are unable to authenticate to Ceph but which ones?

    This past weekend I completed migrating my config management to Ansible and updated all PVE packages, rebooting each node as I did so. Afterwards, 4 of my 5 nodes started throwing errors that some Ceph client(s) using the client.admin client are unable to authenticate...
  2. I

    Proxmox Datacenter Manager - First Alpha Release

    Is there any reason why this couldn't be placed in a docker image? Additionally, how do y'all feel about releasing an official image?
  3. I

    Installing "ceph-exporter" Daemon

    No, I'm just taking the performance hit for now.
  4. I

    Installing "ceph-exporter" Daemon

    According to the ceph documentation, at least as of Reef, the mgrs no longer export perf counters by default (https://docs.ceph.com/en/reef/mgr/prometheus/#id1) which I thought wouldn't be a big deal for me. However, some of these counters include OSD storage information, in particular...
  5. I

    Corosync KNET Flapping

    So the node got fenced again yesterday and now I'm not seeing errors anymore. There doesn't seem to be anything more in `daemon.log`. It includes the usual flapping followed by what looks like startup logs. Monitoring corosync stats might be a good idea. I'll try and get that worked in there to...
  6. I

    Corosync KNET Flapping

    I've attached the cmap stats for the other nodes here. Node 1 is the node that fenced Node 5. The metrics up there are from Node 5.
  7. I

    Corosync KNET Flapping

    Lucky for me, I have that info. From one of the other nodes: 2023-03-31T12:50:05-04:00 service 'vm:112': state changed from 'fence' to 'recovery' 2023-03-31T12:50:05-04:00 service 'vm:109': state changed from 'fence' to 'recovery' 2023-03-31T12:50:05-04:00 node 'zorya': state changed...
  8. I

    Corosync KNET Flapping

    There are 0 errors on all nodes... There are retries across all nodes but no node is an outlier. In the logs, it always seems to be link 1 that's failing, never link 0. Looking at the logs some more, every node claims that "host 5 joined", host 5 claims every other node joined. Since this node...
  9. I

    Corosync KNET Flapping

    So this cluster has: 1. Intel Core i7-9700k w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 2. Intel Xeon E3-1240L v5 w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 3. Intel Xeon D-1521 w/Intel Corporation 82599ES 10-Gigabit SFI/SFP+...
  10. I

    Corosync KNET Flapping

    The interfaces I have for my corosync rings are not bonded. This just started happening again today and I'm not sure what changed. I don't see any errors so far on the links but I'm seeing constant flapping. If this is related to mac address aging, why would it have such a weird pattern...
  11. I

    ceph-mgr crash when enabling perf stats

    I'm having some performance issues on CephFS I'm trying to track down. I tried enabling stats following the information on https://docs.ceph.com/en/quincy/cephfs/cephfs-top/. The second I do ceph mgr module enable stats, all of my MDSes start to puke the following: 2023-02-19T21:19:31.164-0500...
  12. I

    Corosync KNET Flapping

    Sorry, I've been on a trip for the past week or so. Anyway, thanks for the tip, I've disabled RSTP and LLDP on ports connected to the cluster nodes. We'll see if that helps.
  13. I

    Corosync KNET Flapping

    Thanks for all of your help! I do have VMs running on vmbr1, those consume ceph client traffic. Not ideal, but as the graph shows, they're not saturated in this situation and the issue is happening on vmbr2.13. vmbr2 is shared between the Ceph backside network and the secondary ring. That...
  14. I

    Corosync KNET Flapping

    The 10 GbE interfaces are not bonded, only the 1 GbE interfaces are bonded and those are used for "public" VM traffic, they're not used for Proxmox clustering or Ceph. They're using LACP and the switch reports that it's fine. Here's an example of one of the host's `/etc/network/interfaces`: ❯...
  15. I

    pvestatd[15200]: status update time

    Not to resurrect an old thread but I'm seeing these errors as well as influxdb read timeouts but `time pvesm status` always returns in less than a second: ❯ time pvesm status Name Type Status Total Used Available % cluster...