Proxmox VE 6.2 Randomly Rebooting

silverstone

Active Member
Apr 28, 2018
9
0
41
34
For some weird reason my recently installed "new" NAS is randomly rebooting (approximatively 1-2 times each hour).

System:
- Supermicro X10SLL-F
- Intel Xeon E3-1270 V3
- 4 x 8GB Unbuffered ECC DDR3
- 2 x Crucial MX100 256GB
- 1 x IBM Exp ServeRAID M1015 (LSI 9220-8i) -> PCI-e pass-through to NAS VM
- 1 x MELLANOX CONNECTX-2 EN 10GBE

IPMI/BMC event log: no warning/error

Proxmox Installation
Community repository with the latest updates applied

root@pve72:~# pveversion -v
Code:
proxmox-ve: 6.2-1 (running kernel: 5.3.18-3-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Issues appears on:
- pve-kernel-5.4.41-1-pve

Limit ZFS memory usage
According to https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage, I explicitely try to limit ZFS memory usage to 4GB of the total 32GB:
root@pve72:~# cat /etc/modprobe.d/zfs.conf
Code:
# Limit RAM usage to 4GB maximum
options zfs zfs_arc_max=4294967296

Reduce Swappiness
For some reason, this is ignored. At each reboot sysctl vm.swappiness returns 60 (default).

root@pve72:~# cat /etc/sysctl.d/swappiness.conf
Code:
# Reduce swappiness to avoid high IO load
vm.swappiness = 10

Putting the same in /etc/sysctl.conf yields the same result (setting is ignored).

VMs:
- 1 single Gentoo Linux 5.4.x VM (NAS) with 4vCPU and 12GB dedicated RAM

CPU temperatures:
Kind of hot (~ 70°C-75°C) during a ZFS snapshot transfer from old NAS to new NAS. Tried to increase fan PWM Duty cycle to 100% and open chassis to provide some more ventilation to the CPU.

Current investigation
I'm currently testing the older pve-kernel-5.3.18-3-pve to see if the issue appears also in that case.

I have several such systems (Xeon E3 v3) and (until now) didn't have problems with them. The current NAS is running on a E5 v2 though.


For reference, the current software configuration running on my current virtualized NAS
- Supermicro X9SRL-F
- 1 x Intel Xeon E5-2680 v2
- 8 x 32GB ECC Registered DDR3

root@pve04:/tools_nfs/Proxmox# pveversion -v
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

I see different versions of qemu, proxmox VE, kernel as well as ZFS.

Additional information
On the affected host, with the 5.4.x kernel, I see the following line in /var/log/messages just before a new system boot
Code:
vfio-pci 0000:05:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
This is the IBM M1015 / LSI SAS2008 controller that is pass-through to the NAS VM.
 

Attachments

  • messages.txt
    457.7 KB · Views: 28
Last edited:
[/CODE]

On the affected host, with the 5.4.x kernel, I see the following line in /var/log/messages just before a new system boot
Code:
vfio-pci 0000:05:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
This is the IBM M1015 / LSI SAS2008 controller that is pass-through to the NAS VM.


happened approx. 15 min before reboot, suspicious is rather the following:

Code:
May 23 04:49:11 pve72 salt-minion[1394]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.


Verify if the issue remains/disappears when deactivating saltstack.
 
happened approx. 15 min before reboot, suspicious is rather the following:

Code:
May 23 04:49:11 pve72 salt-minion[1394]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.


Verify if the issue remains/disappears when deactivating saltstack.
Thank you for your reply, Richard.
The system runs fine with kernel from the 5.3.x series. Now up since 5 days and still going strong.
Something is definitively wrong with the kernel in the 5.4.x series.

Why would the issue be caused by saltstack? You think the salt minion is causing a kernel panic :oops: ? Did you have any previous experience / report suggesting this?

I thought that the shutdown / restart sequence is being triggered by the vfio line:
Code:
vfio-pci 0000:05:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

And therefore the salt service, zed service, all network bridges/interfaces etc shut down.

But yes, you are right, I missed the part about "15 minutes before reboot".
 
Why would the issue be caused by saltstack? You think the salt minion is causing a kernel panic :oops: ? Did you have any previous experience / report suggesting this?

An application which causes a message before reboot is always suspicious. If the Reboot happens also without it you know at least you can exclude it.

Another check more such incidents in the syslog having a look whether certain messages can be seen in each case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!