Hi,
Saturday I decided to upgrade my proxmox cluster from 7.1 to 7.3-3. Was probably a bad idea. It crashed 5 hours later @ 5pm local time, and is now crashing every 24 hours around the same time. Was almost exactly the same time [5:02pm] but today was 5:10pm.
This is a 4 node cluster, running on enteprise hardware [Quanta] with 10GB networking, and all flash ceph storage.
Syslog messages are basically non-existent, though it seems it starts with '
I have noticed that doing almost anything with ceph causes it to happen [I tried ugprading ceph to 17 and it rebooted the whole cluster, and i stoppeed an OSD today that also caused a cluster wide reboot]. I also see some ceph slow ops messages after restart for a while, but eventually everything settles down until the next day.
This cluster was operating fine for a couple years now all the way up to this upgrade. I've tried running older kernel [5.13] but this did not fix the issue, ensured IPMI_Watchdog was working correctly, and even set up a 2nd corosync ring on another network to try and troubleshoot this.
This sounds like networking to me, but I cannot see how - the network is fine and stable with little load. The system is backed up every evening which puts a much more significant strain on the network and experiences no such issue and the switches report no issue either other than ports going up and down, as expected during reboot.
I am at a complete loss to fix this, and might have to rebuild back to 7,1 if i cannot get this fixed. I'd be grateful for any insight or troubleshooting tips anyone can offer.
Thank you!
Saturday I decided to upgrade my proxmox cluster from 7.1 to 7.3-3. Was probably a bad idea. It crashed 5 hours later @ 5pm local time, and is now crashing every 24 hours around the same time. Was almost exactly the same time [5:02pm] but today was 5:10pm.
This is a 4 node cluster, running on enteprise hardware [Quanta] with 10GB networking, and all flash ceph storage.
Syslog messages are basically non-existent, though it seems it starts with '
pvestatd - got timeout
' right when the problem starts. A couple of times I've gotten emails about node failures and fencing, and a couple of times I've seen watchdog messages, but nothing consistent across all nodes or failures. I have noticed that doing almost anything with ceph causes it to happen [I tried ugprading ceph to 17 and it rebooted the whole cluster, and i stoppeed an OSD today that also caused a cluster wide reboot]. I also see some ceph slow ops messages after restart for a while, but eventually everything settles down until the next day.
This cluster was operating fine for a couple years now all the way up to this upgrade. I've tried running older kernel [5.13] but this did not fix the issue, ensured IPMI_Watchdog was working correctly, and even set up a 2nd corosync ring on another network to try and troubleshoot this.
This sounds like networking to me, but I cannot see how - the network is fine and stable with little load. The system is backed up every evening which puts a much more significant strain on the network and experiences no such issue and the switches report no issue either other than ports going up and down, as expected during reboot.
I am at a complete loss to fix this, and might have to rebuild back to 7,1 if i cannot get this fixed. I'd be grateful for any insight or troubleshooting tips anyone can offer.
Thank you!