Proxmox 7.3-3 crashing every 24 hours after upgrade from 7.1

pinball_newf · Dec 15, 2022

Hi,
Saturday I decided to upgrade my proxmox cluster from 7.1 to 7.3-3. Was probably a bad idea. It crashed 5 hours later @ 5pm local time, and is now crashing every 24 hours around the same time. Was almost exactly the same time [5:02pm] but today was 5:10pm.

This is a 4 node cluster, running on enteprise hardware [Quanta] with 10GB networking, and all flash ceph storage.

Syslog messages are basically non-existent, though it seems it starts with 'pvestatd - got timeout' right when the problem starts. A couple of times I've gotten emails about node failures and fencing, and a couple of times I've seen watchdog messages, but nothing consistent across all nodes or failures.
I have noticed that doing almost anything with ceph causes it to happen [I tried ugprading ceph to 17 and it rebooted the whole cluster, and i stoppeed an OSD today that also caused a cluster wide reboot]. I also see some ceph slow ops messages after restart for a while, but eventually everything settles down until the next day.

This cluster was operating fine for a couple years now all the way up to this upgrade. I've tried running older kernel [5.13] but this did not fix the issue, ensured IPMI_Watchdog was working correctly, and even set up a 2nd corosync ring on another network to try and troubleshoot this.
This sounds like networking to me, but I cannot see how - the network is fine and stable with little load. The system is backed up every evening which puts a much more significant strain on the network and experiences no such issue and the switches report no issue either other than ports going up and down, as expected during reboot.

I am at a complete loss to fix this, and might have to rebuild back to 7,1 if i cannot get this fixed. I'd be grateful for any insight or troubleshooting tips anyone can offer.
Thank you!

Dunuin · Dec 15, 2022

Did you also test the newer 5.19 or 6.1 kernel?
Any cronjobs or systemd timers triggering at 5 o clock?

pinball_newf · Dec 15, 2022

Dunuin said:
Did you also test the newer 5.19 or 6.1 kernel?
Any cronjobs or systemd timers triggering at 5 o clock?

I haven't tested the newer kernels on the basis that rolling back to 5.13 didn't fix the issue, but I'll give it a go.
Good call on the systemd timers or cronjobs, but nothing in either.

pinball_newf · Dec 16, 2022

Kernel 6.1 did not fix this issue. Watching the journal i saw an ATA error before a node went down.. Hopefully rebuilding to 7,1 fixes this.

pinball_newf · Dec 22, 2022

Following up on this in case someone comes along in the future with the same problem.
The problem turned out to be pve/ceph executing smartctl -x on the boot drives. The boot drives in this system are inno disk 64 GB SATADOM drives. When ceph executes the smartctl command the drive resets causing a slowdown that then becomes a failure as corosync and ceph see timeouts and begin to fence off the node. All 4 nodes do this at the same time and kill the cluster.
I don't know why this started happening after the 7,1-7.3 upgrade, reverting to a clean 7.2 install did not fix it, and rolling the 7,2 cluster back to 7,1 didn't either [don't do this, is bad!]
It is possible the servers rebooting triggered this or perhaps these checks were added at a later date, not really sure, but am able to replicate it by executing the command manually.
I hope this helps someone in the future!

Search

Search

Proxmox 7.3-3 crashing every 24 hours after upgrade from 7.1

pinball_newf

New Member

Dunuin

Distinguished Member

pinball_newf

New Member

pinball_newf

New Member

pinball_newf

New Member