So the node got fenced again yesterday and now I'm not seeing errors anymore.
There doesn't seem to be anything more in `daemon.log`. It includes the usual flapping followed by what looks like startup logs. Monitoring corosync stats might be a good idea. I'll try and get that worked in there to...
Lucky for me, I have that info. From one of the other nodes:
2023-03-31T12:50:05-04:00 service 'vm:112': state changed from 'fence' to 'recovery'
2023-03-31T12:50:05-04:00 service 'vm:109': state changed from 'fence' to 'recovery'
2023-03-31T12:50:05-04:00 node 'zorya': state changed...
There are 0 errors on all nodes... There are retries across all nodes but no node is an outlier. In the logs, it always seems to be link 1 that's failing, never link 0. Looking at the logs some more, every node claims that "host 5 joined", host 5 claims every other node joined. Since this node...
The interfaces I have for my corosync rings are not bonded. This just started happening again today and I'm not sure what changed. I don't see any errors so far on the links but I'm seeing constant flapping.
If this is related to mac address aging, why would it have such a weird pattern...
I'm having some performance issues on CephFS I'm trying to track down. I tried enabling stats following the information on https://docs.ceph.com/en/quincy/cephfs/cephfs-top/. The second I do ceph mgr module enable stats, all of my MDSes start to puke the following:
2023-02-19T21:19:31.164-0500...
Sorry, I've been on a trip for the past week or so. Anyway, thanks for the tip, I've disabled RSTP and LLDP on ports connected to the cluster nodes. We'll see if that helps.
Thanks for all of your help! I do have VMs running on vmbr1, those consume ceph client traffic. Not ideal, but as the graph shows, they're not saturated in this situation and the issue is happening on vmbr2.13. vmbr2 is shared between the Ceph backside network and the secondary ring. That...
The 10 GbE interfaces are not bonded, only the 1 GbE interfaces are bonded and those are used for "public" VM traffic, they're not used for Proxmox clustering or Ceph. They're using LACP and the switch reports that it's fine.
Here's an example of one of the host's `/etc/network/interfaces`:
❯...
Not to resurrect an old thread but I'm seeing these errors as well as influxdb read timeouts but `time pvesm status` always returns in less than a second:
❯ time pvesm status
Name Type Status Total Used Available %
cluster...
Every so often (like right now), I'll start seeing a lot of KNET logs about a link going down and coming back up. Sometimes rebooting one node will fix it, sometimes it won't. It seems to happen randomly after node reboots or some other event. How can I determine which node is causing this or...
In preparation for upgrading to Ceph Quincy (from Pacific), I had to remove and re-add three of my monitors as they were still using leveldb. Doing this through the GUI didn't work as it was somehow unable to find monitors with that ID. These are fairly old monitors so I ended up having to do...
Update: So one of two things made this go away:
1. Rebooting all the nodes.
2. Migrating a VM off of one of the nodes.
When doing the live migration of the VM..it was going absurdly slowly, like 56k over a 10GbE link with no other traffic. After migrating that VM and rebooting, all is well...
Right I did think that Ceph was just saturating the link but after the rebalancing stopped and I started seeing very little traffic on the 10g link...that's when I got suspicious. I'll do some digging on how to read that stats output but if you could help me interpret it, that would be helpful...
We had a power outage yesterday. For unrelated reasons, 4/5 of the nodes did not power up successfully. As I powered them up, I would fix each node, which added maybe 5 minutes to each node's bootup. After this, I started seeing these messages constantly on all nodes but _only_ for the secondary...
So this is now solved, I think. According to my error log, it seems like, somehow, an MDS was reporting an older version than 16.2.7. Checking the CephFS section in the Proxmox web UI, I actually had 2 MDSes that were reporting _no_ version string at all. I first did
ceph mds fail...
Yup, I can confirm. Tried to turn it off this morning and I still get assertion failures. I wonder if I can delete and re-create each mon one at a time...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.