Proxmox: Server reboots randomly

nash.burns · May 7, 2018

Hi Guys,

We are noticing random server reboots and there is nothing found in the logs.

The following are the Only PROXMOX errors in logs before reboot.
-----
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:24 cav-cwh-node3 pmxcfs[4095]: [status] crit: cpg_send_message failed: 9
Apr 22 05:21:25 cav-cwh-node3 snmpd[3932]: error on subcontainer 'ia_addr' insert (-1)
Apr 22 05:40:29 cav-cwh-node3 kernel: imklog 5.8.11, log source = /proc/kmsg started.
-----

Could someone shed some light on this or anybody experienced a similar scenario?

I found a similar thread discussing about this, but nothing helpful there as well.

https://forum.proxmox.com/threads/cpg_send_message-failed-9.13423/

Xahid · May 7, 2018

You didn't post much information for developers to understand.
My guess, If you have created the Cluster, its not working well, which is causing the reboot.

nash.burns · May 8, 2018

Hi,

The cluster is working fine. However, the server gets rebooted randomly(not very frequently but like once in a month or something like that. It doesn't look like an OS issue since nothing related to OS is seen in the logs. Is there any specific details you require?

adamb · May 8, 2018

nash.burns said:
Hi,

The cluster is working fine. However, the server gets rebooted randomly(not very frequently but like once in a month or something like that. It doesn't look like an OS issue since nothing related to OS is seen in the logs. Is there any specific details you require?

Typically its either a cluster network issue or a hardware issue. Chances of it being an OS issue are pretty slim. Do you have other nodes with the same hardware having the issue as well?

GadgetPig · May 9, 2018

I also think it may be a hardware issue, or possibly your UPS battery may need replacing. Several years back I had an old PowerEdge T300 server rebooting with ProxMox. First thing I did was replace the power supply (and CPU/Case fans for good measure) and it's been stable since. Another time a different server was rebooting, but thru troubleshooting (connecting server to non UPS backup outlet) I found it was the UPS battery going bad. After replacing the UPS battery it's been stable since. You can also try testing server memory thru memtest for like a few hours or so. Usually bad memory will often show up within an hour of memtest.

Out of curiosity, is this node connected to an NFS server for storage? Any anomalies on the storage server? Are you using ZFS on linux local storage? And do you have at least 50% memory allocated to ZFS?

nash.burns · May 11, 2018

adamb said:
Typically its either a cluster network issue or a hardware issue. Chances of it being an OS issue are pretty slim. Do you have other nodes with the same hardware having the issue as well?

Nope, I don't see any issue on the other node yet.

nash.burns · May 11, 2018

GadgetPig said:
I also think it may be a hardware issue, or possibly your UPS battery may need replacing. Several years back I had an old PowerEdge T300 server rebooting with ProxMox. First thing I did was replace the power supply (and CPU/Case fans for good measure) and it's been stable since. Another time a different server was rebooting, but thru troubleshooting (connecting server to non UPS backup outlet) I found it was the UPS battery going bad. After replacing the UPS battery it's been stable since. You can also try testing server memory thru memtest for like a few hours or so. Usually bad memory will often show up within an hour of memtest.

Out of curiosity, is this node connected to an NFS server for storage? Any anomalies on the storage server? Are you using ZFS on linux local storage? And do you have at least 50% memory allocated to ZFS?

I have confirmed that it is not a hardware issue. Also Plenty of RAM available for ZFS. Is it something with the Proxmox version? The current version is 3.4-6/102d4547

gkovacs · May 11, 2018

nash.burns said:
We are noticing random server reboots and there is nothing found in the logs.

Are you on ZFS? Is your system low on free memory? If your answer to both questions is yes, then your spontaneous reboot can be prevented by the following two steps:

1. ZFS ARC size
We aggressively limit the ZFS ARC size, as it has led to several spontaneous reboots in the past when left unlimited. Basically, we add up all the memory the system uses without caches and buffers (like all the KVM maximum RAM combined), subtract that from total host RAM, and set the ARC to something a bit less than that, so it has to compete with system cache only. For example: on a 32GB server the maximum RAM allocation of KVM guests is 25 GB, so we set the ARC to max out at 5GB (leaving 2GB for anything else). We also set a lower limit of 1GB to the ARC, as it has been reported that it helps performance.

To do that, you have add the following lines to /etc/modprobe.d/zfs.conf

Code:

options zfs zfs_arc_max=5368709120
options zfs zfs_arc_min=1073741824

and after that run:

Code:

# update-initramfs -u

and reboot.

Looking at the ARC of this very server with arc_summary.py you can see it stays between the limits:

Code:

ARC Size:                               30.72%  1.54    GiB
       Target Size: (Adaptive)         30.72%  1.54    GiB
       Min Size (Hard Limit):          20.00%  1.00    GiB
       Max Size (High Water):          5:1     5.00    GiB

ARC Size Breakdown:
       Recently Used Cache Size:       35.27%  554.85  MiB
       Frequently Used Cache Size:     64.73%  1018.10 MiB

2. SWAP on ZFS zvol
You also have to make sure that swap behaves well if it resides on a ZFS zvol (default Proxmox installation places it there). Most important is disabling ARC caching the swap volume, but the other tweaks are important as well (and endorsed by the ZFS on Linux community):
https://github.com/zfsonlinux/zfs/wiki/FAQ

Execute these commands in your shell (left out the # so you can copy all lines at once):

Code:

zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=zle rpool/swap
zfs set checksum=off rpool/swap
zfs set sync=always rpool/swap
zfs set logbias=throughput rpool/swap

You can verify these settings by running:

Code:

# zfs get all rpool/swap

Search

Search

Proxmox: Server reboots randomly

nash.burns

New Member

Xahid

Renowned Member

nash.burns

New Member

adamb

Famous Member

GadgetPig

Member

nash.burns

New Member

nash.burns

New Member

gkovacs

Renowned Member

We value your privacy