[SOLVED] Server crashes since upgrade to 6.1-8

Fonta · Mar 29, 2020

HI All,

I've upgraded PVE to 6.1-8 this week and now seem to experiencing crashes.
Looking though the syslog and messages and other logs I don't seem to find anything in particular which might me be causing this.
Is anyone able to help me? How should i further investigate this?

The server runs as a standalone node on a HPE Microsever Gen10. I've already upgraded the BIOS to the latest version and also tested kernel 5.4.

Attached are some log files.

Thanks!

Fonta

t.lamprecht · Mar 30, 2020

Fonta said:
I've upgraded PVE to 6.1-8 this week and now seem to experiencing crashes.

What are the crash symptoms? Because as you said, there's nothing really obvious in the logs at all.. Are they from a time where such a crash happened?

One thing I found is:

Code:

Mar 29 06:40:20 pve kernel: [    0.510505] sysfs: cannot create duplicate filename '/firmware/acpi/tables/data/BERT'
Mar 29 06:40:20 pve kernel: [    0.510511] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.4.24-1-pve #1
Mar 29 06:40:20 pve kernel: [    0.510513] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 02/19/2020
Mar 29 06:40:20 pve kernel: [    0.510515] Call Trace:
Mar 29 06:40:20 pve kernel: [    0.510539]  dump_stack+0x6d/0x9a
Mar 29 06:40:20 pve kernel: [    0.510546]  sysfs_warn_dup.cold.5+0x17/0x32
Mar 29 06:40:20 pve kernel: [    0.510549]  sysfs_add_file_mode_ns+0x158/0x190
Mar 29 06:40:20 pve kernel: [    0.510553]  sysfs_create_bin_file+0x64/0x90
Mar 29 06:40:20 pve kernel: [    0.510559]  acpi_bert_data_init+0x37/0x50
Mar 29 06:40:20 pve kernel: [    0.510563]  acpi_sysfs_init+0x17b/0x23b
Mar 29 06:40:20 pve kernel: [    0.510567]  ? acpi_sleep_proc_init+0x2a/0x2a
Mar 29 06:40:20 pve kernel: [    0.510571]  ? do_early_param+0x95/0x95
Mar 29 06:40:20 pve kernel: [    0.510573]  acpi_init+0x170/0x31a
Mar 29 06:40:20 pve kernel: [    0.510578]  do_one_initcall+0x4a/0x1fa
Mar 29 06:40:20 pve kernel: [    0.510582]  ? do_early_param+0x95/0x95
Mar 29 06:40:20 pve kernel: [    0.510585]  kernel_init_freeable+0x1b8/0x25d
Mar 29 06:40:20 pve kernel: [    0.510590]  ? rest_init+0xb0/0xb0
Mar 29 06:40:20 pve kernel: [    0.510592]  kernel_init+0xe/0x100
Mar 29 06:40:20 pve kernel: [    0.510596]  ret_from_fork+0x22/0x40
Mar 29 06:40:20 pve kernel: [    0.512157] ACPI: Interpreter enabled
Mar 29 06:40:20 pve kernel: [    0.512200] ACPI: (supports S0 S5)
Mar 29 06:40:20 pve kernel: [    0.512203] ACPI: Using IOAPIC for interrupt routing
Mar 29 06:40:20 pve kernel: [    0.512275] [Firmware Bug]: HEST: Table contents overflow for hardware error source: 2.
Mar 29 06:40:20 pve kernel: [    0.512279] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug

Fonta · Mar 30, 2020

The logs show a couple of crashes. Like in the syslog.txt you can see that all is well and then all of a sudden it starts giving boot messages.
E.G.

Code:

Mar 29 06:37:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 06:37:01 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 06:37:01 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 06:38:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 06:38:01 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 06:38:01 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'iscsi_tcp'
Mar 29 06:40:20 pve dmeventd[464]: dmeventd ready for processing.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'ib_iser'
Mar 29 06:40:20 pve lvm[464]: Monitoring thin pool pve-data-tpool.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'vhost_net'
Mar 29 06:40:20 pve systemd[1]: Starting Flush Journal to Persistent Storage...
Mar 29 06:40:20 pve systemd[1]: Started udev Coldplug all Devices.

The crashes do seem to happen in the morning when Plex (running on a shield) performs its daily maintenance on a network share in one of the VMs.

Not sure what I can do with that duplicate filename message.

t.lamprecht · Mar 30, 2020

Fonta said:
The logs show a couple of crashes. Like in the syslog.txt you can see that all is well and then all of a sudden it starts giving boot messages.
E.G.

That is no crash?? What in the quoted log indicates the crash?

Fonta · Mar 30, 2020

Well I'm simply assuming there's a crash because why else would the logs all of a sudden show messages from the system being booted?
Also the uptime of pve resets.
At 6:40:20 it starts booting.

Fonta · Apr 1, 2020

2 days ago I did some firmware updating. This helped the server by lowering the amount of used resources.
As you can see in the screenshot, after updating the firmware (during the cut out) for nic and hd controller the resource usage doesn't go up daily.
The first day after updating the firmware (on 31-3) the scheduled tasks in Plex were disabled and the server kept running in the morning.
Yesterday evening I re-enabled the tasks and the servers was rebooted this morning.

The plex server is running on a Nvidia shield and the library is on a share within a VM which runs on Ubuntu.
Somehow this does something with the host that it all of a sudden reboots without warnings in the logs.
You only see it booting.

I also kept cat /dev/kmsg running from another machine during the night and this morning the sessions was disconnected, unfortunately it didn't give any message stating it was going to reboot.

This wasn't going on in pve 5.4, so I'm really stuck here.
Can I downgrade?

udo · Apr 1, 2020

Hi,
do you use zfs on this server?
If yes - do you have set zfs_arc_min + zfs_arc_max?
I had reboots (but time ago) with an low memory server on heavy zfs-IO.

Udo

Fonta · Apr 1, 2020

udo said:
Hi,
do you use zfs on this server?
If yes - do you have set zfs_arc_min + zfs_arc_max?
I had reboots (but time ago) with an low memory server on heavy zfs-IO.

Udo

I'm indeed using zfs. Don't think I've set those values.
Where should I set those?

udo · Apr 1, 2020

Hi,
in this case for 6GB (perhaps too much for an 16GB-server):

Code:

cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_min=6442450944
options zfs zfs_arc_max=6442450944
#
# after changes it's important to run
# update-initramfs -u
# and reboot

Udo

Fonta · Apr 1, 2020

udo said:
Hi,
in this case for 6GB (perhaps too much for an 16GB-server):

Code:

cat /etc/modprobe.d/zfs.conf options zfs zfs_arc_min=6442450944 options zfs zfs_arc_max=6442450944 # # after changes it's important to run # update-initramfs -u # and reboot

Udo

Thanks, I'll give it a try with the values I found on https://www.solaris-cookbook.eu/lin...untu-centos-zfs-on-linux-zfs-limit-arc-cache/

Fonta · Apr 2, 2020

udo said:
Hi,
in this case for 6GB (perhaps too much for an 16GB-server):

Code:

cat /etc/modprobe.d/zfs.conf options zfs zfs_arc_min=6442450944 options zfs zfs_arc_max=6442450944 # # after changes it's important to run # update-initramfs -u # and reboot

Udo

First morning with the scheduled tasks enabled and it ekpt running all through

Let's hope this was the fix. Will check again tomorrow.

Fonta · Apr 3, 2020

New update. The server now has an uptime of more than 1,5 days! And therefor I can say it survived 2 mornings, which is great.

Limiting the memory on ZFS might have solved it.

Search

Search

[SOLVED] Server crashes since upgrade to 6.1-8

Fonta

New Member

Attachments

t.lamprecht

Proxmox Staff Member

Fonta

New Member

t.lamprecht

Proxmox Staff Member

Fonta

New Member

Fonta

New Member

udo

Distinguished Member

Fonta

New Member

udo

Distinguished Member

Fonta

New Member

Fonta

New Member

Fonta

New Member