PVE 5.4 - Nodes suddenly reboot - no entries in logs

haukenx · Aug 27, 2019

Hi forum,

we have a 4 node setup based on Supermicro Superservers with the latest PVE5.4 and Ceph. For about 2 months now, we observe sudden reboots of single nodes about once per week without any hints in the logs (messages, syslog, kern.log). It seems like the node is running without any trouble and suddenly decides to do a hard reboot. No kernel panic, no watchdog message - nothing.

Can someone give us a hint on how to debug this issue?

Hauke

Chris · Aug 27, 2019

Hi,
can you exclude hardware problems? Is it alway the same node or does it happen on different/random nodes?

haukenx · Aug 27, 2019

Hi Chris,

I am confident that it is not primarily a hardware problem. All nodes are affected, randomly. Of course, it can still be hardware-related, because all nodes have the same hardware (except HDDs/SDDs on node 4).

Kind regards,
Hauke

Chris · Aug 27, 2019

Without any hints in the logs indeed hard to debug. What about a firmware issue? Is the BIOS up to date on each node? Do you have any load when this happens? Maybe the PSU?

haukenx · Aug 27, 2019

Hi Chris,

good point. We updated the BIOS a few weeks ago exactly because of this issue. We do not observe much CPU load (the machines are a little oversized regarding CPU). During the time of the last failure, the CPU load was <10%.

We did/do have, however, heavy traffic situations during most of the sudden reboots (but not during all of them), due to VM snapshotting and backups, both taking several hours. Most of the heavy traffic goes through a 10GB device (Intel 82599ES) towards Ceph, whereas some traffic also hits the "frontend" 1GB device (Intel I210).

Once in a while, we could see messages similar to this:

NETDEV WATCHDOG: eno2 (igb): transmit queue 0 timed out

in kern.log. In one other occasion, the 10G device also ran into a TX timeout and resetted itself directly before the hard reset. So we decided to install irq-balance, which seemed to help a little. But obviosly, this was not the problem, since the reboots still occur.

Kind regards,
Hauke

FuriousPy · Aug 27, 2019

I haved that problem in one server, was thermal problem

is there any BIOS CPU temperature overheat protection ?
The Intel 82599ES has a good thermal air flow ?
check on BIOS temperature monitor and touch the Intel 82599ES heatsink when working...

good luck

haukenx · Aug 28, 2019

Hi,

thanks for the hint. The CPU temperature does not seem to be the problem, it remains at a stable <45° Celsius all the time. The overheat protection would trigger at 89 degrees.

As for the airflow/temperature of the NIC, I will have to check with our hardware provider, the machines are in another datacenter, where I do not have direct access to. But I'll check that.

Kind regards,
Hauke

MMartinez · Aug 28, 2019

Are you using a dedicated corosync network? You say that the problem usually happends when doing backups.

Regards,

Manu

haukenx · Aug 28, 2019

Hi Manu,

thanks for your answer. No, we do not use a dedicated corosync network. But all backup/snapshotting traffic goes through a dedicated storage network, so there is no high load on the network, the corosync uses. I can see that in our Grafana-instance, we use to monitor the cluster. Backup traffic only hits the 10G device (and creates load of up to1.5Gbit/s), the internal network is (almost) traffic free during that times.

But just to be sure: Where would I have a chance to see, if corosync runs out of sync, except in the logfiles (where there is nothing to see, sadly)?

Kind regads,
Hauke

MMartinez · Aug 28, 2019

Hi,

Maybe you can just check at the HA Status to see if the nodes keep quorum during the backup. If I'm not wrong, if one node gets out of sync it simply reboots (if you use HA).

You can also try to reduce the network load during the backups using the bwlimit option in /etc/vzdump.conf.

Hope it helps.

Regards,

Manu

haukenx · Sep 6, 2019

Short update: Did Upgrade to PVE6 (hoping for a newer version of ixgbe driver) and limited the vzdump bandwidth. The latter seems to be a temporary fix, but of course, we will only see over the next week, if it acutally works.

Kind regards,
Hauke

gkovacs · Sep 6, 2019

haukenx said:
Hi forum,

we have a 4 node setup based on Supermicro Superservers with the latest PVE5.4 and Ceph. For about 2 months now, we observe sudden reboots of single nodes about once per week without any hints in the logs (messages, syslog, kern.log). It seems like the node is running without any trouble and suddenly decides to do a hard reboot. No kernel panic, no watchdog message - nothing.

Can someone give us a hint on how to debug this issue?

Hauke

Do you boot from ZFS on your server that reboots without any warning?
Also, what is your software+hardware configuration? (processor, memory, disks, filesystem, swap, etc.)

Search

Search

PVE 5.4 - Nodes suddenly reboot - no entries in logs

haukenx

Active Member

Chris

Proxmox Staff Member

haukenx

Active Member

Chris

Proxmox Staff Member

haukenx

Active Member

FuriousPy

Member

haukenx

Active Member

MMartinez

Renowned Member

haukenx

Active Member

MMartinez

Renowned Member

haukenx

Active Member

gkovacs

Renowned Member

We value your privacy