PBS failure with kernel 7.0

Leopold31 · May 26, 2026

So we built a PBS on an old HP Microserver gen 8. It consists of 4 SSD Samsung 863a enterprise grade for the datastore, in ZFS RAIDZ1 configuration. The specs for this server are shown here:

We have added a separate boot drive to the spare SATA connector, also a Samsung SM863A.

The specs for our PVE cluster are 3 HP Gen 11, 128GB RAM, 4 Samsung SM863a set up in RAIDZ10 for the datastore, and a separate NVMe drive on a PCI card for a boot disk. The specs of our PVE cluster are here:

Both PBS and PVE cluster hosts all have Mellanox 10GBe NICs.

So the issue we are having is when we run a backup to the PBS, the PBS crashes. We have to do a forced reboot which we do through ILO. We have isolated the issue to the 7.0 version kernels. When we pin the kernel to any of the 6.17 kernels, the issue goes away. If we are not performing a backup with kernel 7, the PBS seems to be fine. But when we run our first backup, it crashes.

So what happens is when we run a backup, the backup starts at normal speed (backup status logs below). And then it gets slow. And it gets slower and slower. It gets to the point where just pulling up a shell in the GUI is quite slow. And eventually, we can't pull up a shell at all. Here we are trying to pull up a shell.

Find attached the task logs for both a kernel 6.17 backup and a kernel 7.0 backup. You can see exactly where things slow down significantly with the 7.0 backup.

Also find attached the journalctl errors on the PBS.

In general, it feels like either a memory leak or an SSD that is throttling because it is getting hot. The problem does not show with kernel 6.17, so I think that rules out an overheating SSD. And we typically don't have those issues with the Samsung SM863a anyway.

In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.

Also, we have another PBS that is on kernel 7 that does not have this issue. However, that PBS is on significantly better hardware so it is probably able to keep up with the data rates, while the old gen 8 server is not. Which makes it sound like the 7.0.10 kernel might be on the right track.

Chris · May 26, 2026

Hi,

Leopold31 said:
In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.

are you referring to commit [0]?

If so, as a first step you could monitor the output of grep skbuff_head_cache /proc/slabinfo as stated in the initial report [1] while performing a backup with running kernel 7.0 to see if you memory grows as described. This should tell if it is related.

Unfortunately using an Ubuntu mainline kernel is not an option to narrow down the issue, as you are using a datastore backed by ZFS.

[0] https://git.kernel.org/pub/scm/linu.../?id=a6bd339dbb3514bce690fdcf252e788dfab4ee76
[1] https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/

Edit: fixed reference to commit

Leopold31 · May 26, 2026

Chris said:
Hi,

are you referring to commit ?

If so, as a first step you could monitor the output of grep skbuff_head_cache /proc/slabinfo as stated in the initial report [1] while performing a backup with running kernel 7.0 to see if you memory grows as described. This should tell if it is related.

Unfortunately using an Ubuntu mainline kernel is not an option to narrow down the issue, as you are using a datastore backed by ZFS.

[0] https://git.kernel.org/pub/scm/linu.../?id=a6bd339dbb3514bce690fdcf252e788dfab4ee76
[1] https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/

I was referring to this report: https://www.linuxcompatible.org/story/linux-kernel-7010-released/

I have a heavy schedule the next few days so I don't see that I'll have time to get to more testing in right away. I'll see if I can get more information when I do. FWIW, the PBS GUI was not indicating RAM usage at limit (it was indicating about 50%), so if there is a memory leak, it was unknown to the system.

Chris · May 26, 2026

You might also want to test if setting the kernel command line parameter libata.force=noncq has an effect, as the issues seem to appear after an DMA error.

Leopold31 · May 26, 2026

Chris said:
You might also want to test if setting the kernel command line parameter libata.force=noncq has an effect, as the issues seem to appear after an DMA error.

That shouldn't take much time. I'll see if I can do that tonight. In the mean-time, I have my other PBS on kernel 7 (seemingly without problems) as well as my PVE. I started becoming concerned about what would happen to a restore. So a restored the 650GB SQL server. It restored in 19 minutes. I disconnected networking and booted it and it was fine. A good solid restore. Whew! Otherwise, I was going to have to pin everything back to 6.17.

Hopefully I can come up with time to do the quick test above tonight. No guaranteees.

Leopold31 · Jun 2, 2026

OK, So I've been pretty busy and hadn't got back to this yet. Until yesterday. I was going through all my backup devices and doing test restores and making sure the VMs that were restored were bootable. And guess what? We had a restore failure on this PBS. We have plenty of backup devices, but a backup device that fails to restore changes the priority quite a bit.

So, I got into the weeds on this. And I'm not saying that kernel 7 is not a problem. But our issue appears to be the Mellanox 10Gbe NIC. So I bypassed that 10Gbe NIC with the built in Broadcom 1Gbe NIC. And this PBS now backs up. And it restores. And it does both on kernel 7.0.6-2-pve. So there is something about the combination of the Mellanox NIC and kernel 7 (and even the latest 6.17 kernel) that is not playing nice.

So I've just left the Broadcom NIC active for now. The PBS is functional. And that makes the priority for diagnosing the Mellanox NIC a lot lower. Unfortunately, we had a 5 hour power outage yesterday that made only priority loads runnable. So I lost a lot of time and will be catching up all week, so I'm not sure when I'll get back to this, but I'm hoping Thursday.

Leopold31 · Jun 2, 2026

And ... it's dead. I found I couldn't log into the PBS. And it wasn't doing anything. It was just sitting there. Damn. Must be kernel 7. Let's pin it back to kernel 6.17. Rebooted. Selected kernel in grub. GUI won't come up. Go back to ILO remote terminal. It won't accept a login. Like it doesn't even recognize typing. OK, Reboot. Selected kernel in grub. Have remote terminal in ILO. I can type in a login password. Filesystem is borked. Mounted as read only. No gui. Can barely do anything.

It's an old gen 8 HP Microserver. We're calling an audible. It's not worth resuscitating. Now to find another box to mount these disks in. So totally not happening right now. I think I can do on Thursday. We'll see.

PBS failure with kernel 7.0

Leopold31

New Member

Attachments

Chris

Proxmox Staff Member

Leopold31

New Member

Chris

Proxmox Staff Member

Leopold31

New Member

Leopold31

New Member

Leopold31

New Member

We value your privacy