PBS is locking up

kransom

Member
Aug 22, 2023
32
1
13
My proxmox backup server is locking up frequently causing no SSH or console access. Only a hard reboot will restore communication to the server. I believe it is a kernel panic or some sort of hardware failure. I try to view the logs, but it seems they aren't capture at the time of failure. I also tried setting up iDRAC on my server, but it is not logging those events. I am wondering if my setup was prone to failing from the beginning and would like any insight on it.

My server is a Dell PowerEdge R330. My external RAID controller is a Nexsan E48. The controller is connected via 2 fiber channel cards and configured using multipath. After configuring a 42 disk RAID 6 array on the controller, a partition map was created on the multipath device using gdisk. Then an ext4 file system was created on that partition. A directory was also created to serve as a mount point for the datastore. In the PBS GUI, a datastore was created with the backing path /mnt/ext_raid.

I have 5 PVE clusters (multiple VMs and containers) and a few standalone hosts backing up to my PBS server. My server had never failed like this before adding the Nexsan, so I am assuming that is what's causing it to lock up. PBS was installed with ZFS.

proxmox-backup-manager versions
proxmox-backup-server 4.1.4-1 running version: 4.1.4
 
Hi, @kransom
If you have or can have a monitor (I mean a physical display) connected to the server, there may be some errors displayed on it when for any reason the system isn't already able to log anything to the files.
 
Hi, @kransom
If you have or can have a monitor (I mean a physical display) connected to the server, there may be some errors displayed on it when for any reason the system isn't already able to log anything to the files.
pbs-error.jpg

Let me know if I need to provide any other information. I tried looking this up before and it looks like a kernel panic.
 
Quite possible. There may be more info above the visible area. Sometimes you can scroll some more visible pages up by means of Shift+PgUp on the keyboard connected to the server (unless also the display is completely locked).

There do exist ways of finding the reason of a panic - with these messages, but I can't help from the top of my head, I'm sorry.
I think one can "google" for the exact method, though.
 
you can try the pstore interface, a serial console or a netconsole to get the full log..
 
That is definitely a stressful situation, especially when your backups are involved. Since it's locking up entirely, it might be worth checking the syslog or journalctl logs specifically for any IO wait spikes or "out of memory" (OOM) errors right before the freeze.
If you’re running PBS on a VM, double-check that you aren't over-provisioning the RAM, as it can be pretty memory-intensive during GC tasks. Also, if you're using ZFS, sometimes a failing drive or a saturated controller can cause the whole kernel to hang.
 
Your BIOS is 9 years old. I would upgrade it in the first place and disable Intel RAPL for testing:

Code:
echo "blacklist intel_rapl_msr" > /etc/modprobe.d/disable-intel-rapl.conf
echo "blacklist intel_rapl_common" >> /etc/modprobe.d/disable-intel-rapl.conf
update-initramfs -u
reboot
 
I was never able to get recovered or full logs whenever the failure happened. I also did try to update the BIOS but that did not help. I gave up on trying to save that system and just rebuilt it. I used XFS instead of EXT4 for the filesystem on the partition created for Nexsan drive chassis, and it seems to be doing better. It's been 2 weeks since the new systems been up and I've had no failures at the moment.
 
Good to hear it's stable now. The EXT4 to XFS switch likely made the difference for a reason worth documenting.

EXT4 serializes writes through its journaling in ways that can amplify I/O latency on large multipath LUNs. With a 42-disk RAID 6 array over fiber channel and multipath, any I/O stall that causes the driver to wait on one path can cascade - the journal flush blocks, memory reclaim blocks behind it, and if PBS garbage collection or ZFS (for the OS) is competing for I/O resources at the same moment, you can get a full hang that looks like a kernel panic from the outside.

XFS uses delayed logging and its journal is considerably less aggressive about flushing in ways that create contention on large LUNs. On high-LUN-count multipath setups it tends to handle path failover and requeue events more gracefully without creating the cascading wait scenario.

The intel_rapl blacklist suggestion is still worth keeping in place even on the rebuilt system - RAPL energy limiting on older Dells can cause unexpected CPU throttling during sustained I/O on a memory and disk intensive workload like PBS GC runs.
 
  • Like
Reactions: Johannes S