PBS is locking up

kransom · Mar 16, 2026

My proxmox backup server is locking up frequently causing no SSH or console access. Only a hard reboot will restore communication to the server. I believe it is a kernel panic or some sort of hardware failure. I try to view the logs, but it seems they aren't capture at the time of failure. I also tried setting up iDRAC on my server, but it is not logging those events. I am wondering if my setup was prone to failing from the beginning and would like any insight on it.

My server is a Dell PowerEdge R330. My external RAID controller is a Nexsan E48. The controller is connected via 2 fiber channel cards and configured using multipath. After configuring a 42 disk RAID 6 array on the controller, a partition map was created on the multipath device using gdisk. Then an ext4 file system was created on that partition. A directory was also created to serve as a mount point for the datastore. In the PBS GUI, a datastore was created with the backing path /mnt/ext_raid.

I have 5 PVE clusters (multiple VMs and containers) and a few standalone hosts backing up to my PBS server. My server had never failed like this before adding the Nexsan, so I am assuming that is what's causing it to lock up. PBS was installed with ZFS.

proxmox-backup-manager versions
proxmox-backup-server 4.1.4-1 running version: 4.1.4

Onslow · Mar 16, 2026

Hi, @kransom
If you have or can have a monitor (I mean a physical display) connected to the server, there may be some errors displayed on it when for any reason the system isn't already able to log anything to the files.

kransom · Mar 16, 2026

Onslow said:
Hi, @kransom
If you have or can have a monitor (I mean a physical display) connected to the server, there may be some errors displayed on it when for any reason the system isn't already able to log anything to the files.

Let me know if I need to provide any other information. I tried looking this up before and it looks like a kernel panic.

Onslow · Mar 16, 2026

Quite possible. There may be more info above the visible area. Sometimes you can scroll some more visible pages up by means of Shift+PgUp on the keyboard connected to the server (unless also the display is completely locked).

There do exist ways of finding the reason of a panic - with these messages, but I can't help from the top of my head, I'm sorry.
I think one can "google" for the exact method, though.

fabian · Mar 17, 2026

you can try the pstore interface, a serial console or a netconsole to get the full log..

cwt · Apr 4, 2026

Your BIOS is 9 years old. I would upgrade it in the first place and disable Intel RAPL for testing:

Code:

echo "blacklist intel_rapl_msr" > /etc/modprobe.d/disable-intel-rapl.conf
echo "blacklist intel_rapl_common" >> /etc/modprobe.d/disable-intel-rapl.conf
update-initramfs -u
reboot

kransom · Apr 4, 2026

I was never able to get recovered or full logs whenever the failure happened. I also did try to update the BIOS but that did not help. I gave up on trying to save that system and just rebuilt it. I used XFS instead of EXT4 for the filesystem on the partition created for Nexsan drive chassis, and it seems to be doing better. It's been 2 weeks since the new systems been up and I've had no failures at the moment.

RianKellyIT · Apr 4, 2026

Good to hear it's stable now. The EXT4 to XFS switch likely made the difference for a reason worth documenting.

EXT4 serializes writes through its journaling in ways that can amplify I/O latency on large multipath LUNs. With a 42-disk RAID 6 array over fiber channel and multipath, any I/O stall that causes the driver to wait on one path can cascade - the journal flush blocks, memory reclaim blocks behind it, and if PBS garbage collection or ZFS (for the OS) is competing for I/O resources at the same moment, you can get a full hang that looks like a kernel panic from the outside.

XFS uses delayed logging and its journal is considerably less aggressive about flushing in ways that create contention on large LUNs. On high-LUN-count multipath setups it tends to handle path failover and requeue events more gracefully without creating the cascading wait scenario.

The intel_rapl blacklist suggestion is still worth keeping in place even on the rebuilt system - RAPL energy limiting on older Dells can cause unexpected CPU throttling during sustained I/O on a memory and disk intensive workload like PBS GC runs.

PBS is locking up

kransom

Member

Onslow

Renowned Member

kransom

Member

Onslow

Renowned Member

fabian

Proxmox Staff Member

cwt

Renowned Member

kransom

Member

RianKellyIT

New Member

We value your privacy