PVE 5.4 - Nodes suddenly reboot - no entries in logs

haukenx

Active Member
Dec 16, 2018
23
1
43
47
Hi forum,

we have a 4 node setup based on Supermicro Superservers with the latest PVE5.4 and Ceph. For about 2 months now, we observe sudden reboots of single nodes about once per week without any hints in the logs (messages, syslog, kern.log). It seems like the node is running without any trouble and suddenly decides to do a hard reboot. No kernel panic, no watchdog message - nothing.

Can someone give us a hint on how to debug this issue?

Hauke
 
Hi,
can you exclude hardware problems? Is it alway the same node or does it happen on different/random nodes?
 
Hi Chris,

I am confident that it is not primarily a hardware problem. All nodes are affected, randomly. Of course, it can still be hardware-related, because all nodes have the same hardware (except HDDs/SDDs on node 4).

Kind regards,
Hauke
 
Without any hints in the logs indeed hard to debug. What about a firmware issue? Is the BIOS up to date on each node? Do you have any load when this happens? Maybe the PSU?
 
Hi Chris,

good point. We updated the BIOS a few weeks ago exactly because of this issue. We do not observe much CPU load (the machines are a little oversized regarding CPU). During the time of the last failure, the CPU load was <10%.

We did/do have, however, heavy traffic situations during most of the sudden reboots (but not during all of them), due to VM snapshotting and backups, both taking several hours. Most of the heavy traffic goes through a 10GB device (Intel 82599ES) towards Ceph, whereas some traffic also hits the "frontend" 1GB device (Intel I210).

Once in a while, we could see messages similar to this:
NETDEV WATCHDOG: eno2 (igb): transmit queue 0 timed out
in kern.log. In one other occasion, the 10G device also ran into a TX timeout and resetted itself directly before the hard reset. So we decided to install irq-balance, which seemed to help a little. But obviosly, this was not the problem, since the reboots still occur.

Kind regards,
Hauke
 
I haved that problem in one server, was thermal problem

is there any BIOS CPU temperature overheat protection ?
The Intel 82599ES has a good thermal air flow ?
check on BIOS temperature monitor and touch the Intel 82599ES heatsink when working...

good luck
 
Hi,

thanks for the hint. The CPU temperature does not seem to be the problem, it remains at a stable <45° Celsius all the time. The overheat protection would trigger at 89 degrees.

As for the airflow/temperature of the NIC, I will have to check with our hardware provider, the machines are in another datacenter, where I do not have direct access to. But I'll check that.

Kind regards,
Hauke
 
Are you using a dedicated corosync network? You say that the problem usually happends when doing backups.

Regards,

Manu
 
Hi Manu,

thanks for your answer. No, we do not use a dedicated corosync network. But all backup/snapshotting traffic goes through a dedicated storage network, so there is no high load on the network, the corosync uses. I can see that in our Grafana-instance, we use to monitor the cluster. Backup traffic only hits the 10G device (and creates load of up to1.5Gbit/s), the internal network is (almost) traffic free during that times.

But just to be sure: Where would I have a chance to see, if corosync runs out of sync, except in the logfiles (where there is nothing to see, sadly)?

Kind regads,
Hauke
 
Hi,

Maybe you can just check at the HA Status to see if the nodes keep quorum during the backup. If I'm not wrong, if one node gets out of sync it simply reboots (if you use HA).

You can also try to reduce the network load during the backups using the bwlimit option in /etc/vzdump.conf.

Hope it helps.

Regards,

Manu
 
Last edited:
Short update: Did Upgrade to PVE6 (hoping for a newer version of ixgbe driver) and limited the vzdump bandwidth. The latter seems to be a temporary fix, but of course, we will only see over the next week, if it acutally works.

Kind regards,
Hauke
 
Hi forum,

we have a 4 node setup based on Supermicro Superservers with the latest PVE5.4 and Ceph. For about 2 months now, we observe sudden reboots of single nodes about once per week without any hints in the logs (messages, syslog, kern.log). It seems like the node is running without any trouble and suddenly decides to do a hard reboot. No kernel panic, no watchdog message - nothing.

Can someone give us a hint on how to debug this issue?

Hauke

Do you boot from ZFS on your server that reboots without any warning?
Also, what is your software+hardware configuration? (processor, memory, disks, filesystem, swap, etc.)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!