[SOLVED] Server crashes since upgrade to 6.1-8

Fonta

New Member
Mar 29, 2020
8
1
3
40
HI All,

I've upgraded PVE to 6.1-8 this week and now seem to experiencing crashes.
Looking though the syslog and messages and other logs I don't seem to find anything in particular which might me be causing this.
Is anyone able to help me? How should i further investigate this?

The server runs as a standalone node on a HPE Microsever Gen10. I've already upgraded the BIOS to the latest version and also tested kernel 5.4.


Attached are some log files.

Thanks!

Fonta
 

Attachments

  • pveversion.txt
    1.3 KB · Views: 3
  • daemon.log
    243 KB · Views: 2
  • kern.log
    283.5 KB · Views: 6
  • messages.txt
    267.4 KB · Views: 2
  • syslog.txt
    531.8 KB · Views: 4
I've upgraded PVE to 6.1-8 this week and now seem to experiencing crashes.
What are the crash symptoms? Because as you said, there's nothing really obvious in the logs at all.. Are they from a time where such a crash happened?

One thing I found is:
Code:
Mar 29 06:40:20 pve kernel: [    0.510505] sysfs: cannot create duplicate filename '/firmware/acpi/tables/data/BERT'
Mar 29 06:40:20 pve kernel: [    0.510511] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.4.24-1-pve #1
Mar 29 06:40:20 pve kernel: [    0.510513] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 02/19/2020
Mar 29 06:40:20 pve kernel: [    0.510515] Call Trace:
Mar 29 06:40:20 pve kernel: [    0.510539]  dump_stack+0x6d/0x9a
Mar 29 06:40:20 pve kernel: [    0.510546]  sysfs_warn_dup.cold.5+0x17/0x32
Mar 29 06:40:20 pve kernel: [    0.510549]  sysfs_add_file_mode_ns+0x158/0x190
Mar 29 06:40:20 pve kernel: [    0.510553]  sysfs_create_bin_file+0x64/0x90
Mar 29 06:40:20 pve kernel: [    0.510559]  acpi_bert_data_init+0x37/0x50
Mar 29 06:40:20 pve kernel: [    0.510563]  acpi_sysfs_init+0x17b/0x23b
Mar 29 06:40:20 pve kernel: [    0.510567]  ? acpi_sleep_proc_init+0x2a/0x2a
Mar 29 06:40:20 pve kernel: [    0.510571]  ? do_early_param+0x95/0x95
Mar 29 06:40:20 pve kernel: [    0.510573]  acpi_init+0x170/0x31a
Mar 29 06:40:20 pve kernel: [    0.510578]  do_one_initcall+0x4a/0x1fa
Mar 29 06:40:20 pve kernel: [    0.510582]  ? do_early_param+0x95/0x95
Mar 29 06:40:20 pve kernel: [    0.510585]  kernel_init_freeable+0x1b8/0x25d
Mar 29 06:40:20 pve kernel: [    0.510590]  ? rest_init+0xb0/0xb0
Mar 29 06:40:20 pve kernel: [    0.510592]  kernel_init+0xe/0x100
Mar 29 06:40:20 pve kernel: [    0.510596]  ret_from_fork+0x22/0x40
Mar 29 06:40:20 pve kernel: [    0.512157] ACPI: Interpreter enabled
Mar 29 06:40:20 pve kernel: [    0.512200] ACPI: (supports S0 S5)
Mar 29 06:40:20 pve kernel: [    0.512203] ACPI: Using IOAPIC for interrupt routing
Mar 29 06:40:20 pve kernel: [    0.512275] [Firmware Bug]: HEST: Table contents overflow for hardware error source: 2.
Mar 29 06:40:20 pve kernel: [    0.512279] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
 
The logs show a couple of crashes. Like in the syslog.txt you can see that all is well and then all of a sudden it starts giving boot messages.
E.G.
Code:
Mar 29 06:37:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 06:37:01 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 06:37:01 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 06:38:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 06:38:01 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 06:38:01 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'iscsi_tcp'
Mar 29 06:40:20 pve dmeventd[464]: dmeventd ready for processing.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'ib_iser'
Mar 29 06:40:20 pve lvm[464]: Monitoring thin pool pve-data-tpool.
Mar 29 06:40:20 pve systemd-modules-load[436]: Inserted module 'vhost_net'
Mar 29 06:40:20 pve systemd[1]: Starting Flush Journal to Persistent Storage...
Mar 29 06:40:20 pve systemd[1]: Started udev Coldplug all Devices.

The crashes do seem to happen in the morning when Plex (running on a shield) performs its daily maintenance on a network share in one of the VMs.

Not sure what I can do with that duplicate filename message.
 
The logs show a couple of crashes. Like in the syslog.txt you can see that all is well and then all of a sudden it starts giving boot messages.
E.G.
That is no crash?? What in the quoted log indicates the crash?
 
Well I'm simply assuming there's a crash because why else would the logs all of a sudden show messages from the system being booted?
Also the uptime of pve resets.
At 6:40:20 it starts booting.
 
2 days ago I did some firmware updating. This helped the server by lowering the amount of used resources.
As you can see in the screenshot, after updating the firmware (during the cut out) for nic and hd controller the resource usage doesn't go up daily.
The first day after updating the firmware (on 31-3) the scheduled tasks in Plex were disabled and the server kept running in the morning.
Yesterday evening I re-enabled the tasks and the servers was rebooted this morning.

The plex server is running on a Nvidia shield and the library is on a share within a VM which runs on Ubuntu.
Somehow this does something with the host that it all of a sudden reboots without warnings in the logs.
You only see it booting.

1585747465505.png
I also kept cat /dev/kmsg running from another machine during the night and this morning the sessions was disconnected, unfortunately it didn't give any message stating it was going to reboot.

This wasn't going on in pve 5.4, so I'm really stuck here.
Can I downgrade?
 
Hi,
do you use zfs on this server?
If yes - do you have set zfs_arc_min + zfs_arc_max?
I had reboots (but time ago) with an low memory server on heavy zfs-IO.

Udo
I'm indeed using zfs. Don't think I've set those values.
Where should I set those?
 
Hi,
in this case for 6GB (perhaps too much for an 16GB-server):
Code:
cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_min=6442450944
options zfs zfs_arc_max=6442450944
#
# after changes it's important to run
# update-initramfs -u
# and reboot
Udo
 
  • Like
Reactions: Fonta
Hi,
in this case for 6GB (perhaps too much for an 16GB-server):
Code:
cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=6442450944
options zfs zfs_arc_max=6442450944
#
# after changes it's important to run
# update-initramfs -u
# and reboot
Udo
First morning with the scheduled tasks enabled and it ekpt running all through :)
Let's hope this was the fix. Will check again tomorrow.
 
New update. The server now has an uptime of more than 1,5 days! And therefor I can say it survived 2 mornings, which is great.
1585916872805.png

Limiting the memory on ZFS might have solved it.
 
  • Like
Reactions: udo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!