Proxmox 7.3-3 crashing every 24 hours after upgrade from 7.1

pinball_newf

New Member
Dec 15, 2022
4
2
3
Hi,
Saturday I decided to upgrade my proxmox cluster from 7.1 to 7.3-3. Was probably a bad idea. It crashed 5 hours later @ 5pm local time, and is now crashing every 24 hours around the same time. Was almost exactly the same time [5:02pm] but today was 5:10pm.

This is a 4 node cluster, running on enteprise hardware [Quanta] with 10GB networking, and all flash ceph storage.

Syslog messages are basically non-existent, though it seems it starts with 'pvestatd - got timeout' right when the problem starts. A couple of times I've gotten emails about node failures and fencing, and a couple of times I've seen watchdog messages, but nothing consistent across all nodes or failures.
I have noticed that doing almost anything with ceph causes it to happen [I tried ugprading ceph to 17 and it rebooted the whole cluster, and i stoppeed an OSD today that also caused a cluster wide reboot]. I also see some ceph slow ops messages after restart for a while, but eventually everything settles down until the next day.

This cluster was operating fine for a couple years now all the way up to this upgrade. I've tried running older kernel [5.13] but this did not fix the issue, ensured IPMI_Watchdog was working correctly, and even set up a 2nd corosync ring on another network to try and troubleshoot this.
This sounds like networking to me, but I cannot see how - the network is fine and stable with little load. The system is backed up every evening which puts a much more significant strain on the network and experiences no such issue and the switches report no issue either other than ports going up and down, as expected during reboot.

I am at a complete loss to fix this, and might have to rebuild back to 7,1 if i cannot get this fixed. I'd be grateful for any insight or troubleshooting tips anyone can offer.
Thank you!
 
Did you also test the newer 5.19 or 6.1 kernel?
Any cronjobs or systemd timers triggering at 5 o clock?
I haven't tested the newer kernels on the basis that rolling back to 5.13 didn't fix the issue, but I'll give it a go.
Good call on the systemd timers or cronjobs, but nothing in either.
 
Kernel 6.1 did not fix this issue. Watching the journal i saw an ATA error before a node went down.. Hopefully rebuilding to 7,1 fixes this.
 
Following up on this in case someone comes along in the future with the same problem.
The problem turned out to be pve/ceph executing smartctl -x on the boot drives. The boot drives in this system are inno disk 64 GB SATADOM drives. When ceph executes the smartctl command the drive resets causing a slowdown that then becomes a failure as corosync and ceph see timeouts and begin to fence off the node. All 4 nodes do this at the same time and kill the cluster.
I don't know why this started happening after the 7,1-7.3 upgrade, reverting to a clean 7.2 install did not fix it, and rolling the 7,2 cluster back to 7,1 didn't either [don't do this, is bad!]
It is possible the servers rebooting triggered this or perhaps these checks were added at a later date, not really sure, but am able to replicate it by executing the command manually.
I hope this helps someone in the future!
 
  • Like
Reactions: wbedard and Dunuin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!