PBS failure with kernel 7.0

Leopold31

New Member
Jul 23, 2024
15
1
3
So we built a PBS on an old HP Microserver gen 8. It consists of 4 SSD Samsung 863a enterprise grade for the datastore, in ZFS RAIDZ1 configuration. The specs for this server are shown here:

PBS Stats.png

We have added a separate boot drive to the spare SATA connector, also a Samsung SM863A.

The specs for our PVE cluster are 3 HP Gen 11, 128GB RAM, 4 Samsung SM863a set up in RAIDZ10 for the datastore, and a separate NVMe drive on a PCI card for a boot disk. The specs of our PVE cluster are here:

PVE Cluster Stats.png

Both PBS and PVE cluster hosts all have Mellanox 10GBe NICs.

So the issue we are having is when we run a backup to the PBS, the PBS crashes. We have to do a forced reboot which we do through ILO. We have isolated the issue to the 7.0 version kernels. When we pin the kernel to any of the 6.17 kernels, the issue goes away. If we are not performing a backup with kernel 7, the PBS seems to be fine. But when we run our first backup, it crashes.

So what happens is when we run a backup, the backup starts at normal speed (backup status logs below). And then it gets slow. And it gets slower and slower. It gets to the point where just pulling up a shell in the GUI is quite slow. And eventually, we can't pull up a shell at all. Here we are trying to pull up a shell.

PBS very slow - cant pull up Shell.png

Find attached the task logs for both a kernel 6.17 backup and a kernel 7.0 backup. You can see exactly where things slow down significantly with the 7.0 backup.

Also find attached the journalctl errors on the PBS.

In general, it feels like either a memory leak or an SSD that is throttling because it is getting hot. The problem does not show with kernel 6.17, so I think that rules out an overheating SSD. And we typically don't have those issues with the Samsung SM863a anyway.

In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.

Also, we have another PBS that is on kernel 7 that does not have this issue. However, that PBS is on significantly better hardware so it is probably able to keep up with the data rates, while the old gen 8 server is not. Which makes it sound like the 7.0.10 kernel might be on the right track.
 

Attachments

Hi,
In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.
are you referring to commit [0]?

If so, as a first step you could monitor the output of grep skbuff_head_cache /proc/slabinfo as stated in the initial report [1] while performing a backup with running kernel 7.0 to see if you memory grows as described. This should tell if it is related.

Unfortunately using an Ubuntu mainline kernel is not an option to narrow down the issue, as you are using a datastore backed by ZFS.

[0] https://git.kernel.org/pub/scm/linu.../?id=a6bd339dbb3514bce690fdcf252e788dfab4ee76
[1] https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/

Edit: fixed reference to commit
 
Last edited:
Hi,

are you referring to commit ?

If so, as a first step you could monitor the output of grep skbuff_head_cache /proc/slabinfo as stated in the initial report [1] while performing a backup with running kernel 7.0 to see if you memory grows as described. This should tell if it is related.

Unfortunately using an Ubuntu mainline kernel is not an option to narrow down the issue, as you are using a datastore backed by ZFS.

[0] https://git.kernel.org/pub/scm/linu.../?id=a6bd339dbb3514bce690fdcf252e788dfab4ee76
[1] https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/

I was referring to this report: https://www.linuxcompatible.org/story/linux-kernel-7010-released/

I have a heavy schedule the next few days so I don't see that I'll have time to get to more testing in right away. I'll see if I can get more information when I do. FWIW, the PBS GUI was not indicating RAM usage at limit (it was indicating about 50%), so if there is a memory leak, it was unknown to the system.
 
You might also want to test if setting the kernel command line parameter libata.force=noncq has an effect, as the issues seem to appear after an DMA error.
 
You might also want to test if setting the kernel command line parameter libata.force=noncq has an effect, as the issues seem to appear after an DMA error.
That shouldn't take much time. I'll see if I can do that tonight. In the mean-time, I have my other PBS on kernel 7 (seemingly without problems) as well as my PVE. I started becoming concerned about what would happen to a restore. So a restored the 650GB SQL server. It restored in 19 minutes. I disconnected networking and booted it and it was fine. A good solid restore. Whew! Otherwise, I was going to have to pin everything back to 6.17.

Hopefully I can come up with time to do the quick test above tonight. No guaranteees.