So we built a PBS on an old HP Microserver gen 8. It consists of 4 SSD Samsung 863a enterprise grade for the datastore, in ZFS RAIDZ1 configuration. The specs for this server are shown here:

We have added a separate boot drive to the spare SATA connector, also a Samsung SM863A.
The specs for our PVE cluster are 3 HP Gen 11, 128GB RAM, 4 Samsung SM863a set up in RAIDZ10 for the datastore, and a separate NVMe drive on a PCI card for a boot disk. The specs of our PVE cluster are here:

Both PBS and PVE cluster hosts all have Mellanox 10GBe NICs.
So the issue we are having is when we run a backup to the PBS, the PBS crashes. We have to do a forced reboot which we do through ILO. We have isolated the issue to the 7.0 version kernels. When we pin the kernel to any of the 6.17 kernels, the issue goes away. If we are not performing a backup with kernel 7, the PBS seems to be fine. But when we run our first backup, it crashes.
So what happens is when we run a backup, the backup starts at normal speed (backup status logs below). And then it gets slow. And it gets slower and slower. It gets to the point where just pulling up a shell in the GUI is quite slow. And eventually, we can't pull up a shell at all. Here we are trying to pull up a shell.

Find attached the task logs for both a kernel 6.17 backup and a kernel 7.0 backup. You can see exactly where things slow down significantly with the 7.0 backup.
Also find attached the journalctl errors on the PBS.
In general, it feels like either a memory leak or an SSD that is throttling because it is getting hot. The problem does not show with kernel 6.17, so I think that rules out an overheating SSD. And we typically don't have those issues with the Samsung SM863a anyway.
In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.
Also, we have another PBS that is on kernel 7 that does not have this issue. However, that PBS is on significantly better hardware so it is probably able to keep up with the data rates, while the old gen 8 server is not. Which makes it sound like the 7.0.10 kernel might be on the right track.

We have added a separate boot drive to the spare SATA connector, also a Samsung SM863A.
The specs for our PVE cluster are 3 HP Gen 11, 128GB RAM, 4 Samsung SM863a set up in RAIDZ10 for the datastore, and a separate NVMe drive on a PCI card for a boot disk. The specs of our PVE cluster are here:

Both PBS and PVE cluster hosts all have Mellanox 10GBe NICs.
So the issue we are having is when we run a backup to the PBS, the PBS crashes. We have to do a forced reboot which we do through ILO. We have isolated the issue to the 7.0 version kernels. When we pin the kernel to any of the 6.17 kernels, the issue goes away. If we are not performing a backup with kernel 7, the PBS seems to be fine. But when we run our first backup, it crashes.
So what happens is when we run a backup, the backup starts at normal speed (backup status logs below). And then it gets slow. And it gets slower and slower. It gets to the point where just pulling up a shell in the GUI is quite slow. And eventually, we can't pull up a shell at all. Here we are trying to pull up a shell.

Find attached the task logs for both a kernel 6.17 backup and a kernel 7.0 backup. You can see exactly where things slow down significantly with the 7.0 backup.
Also find attached the journalctl errors on the PBS.
In general, it feels like either a memory leak or an SSD that is throttling because it is getting hot. The problem does not show with kernel 6.17, so I think that rules out an overheating SSD. And we typically don't have those issues with the Samsung SM863a anyway.
In looking through the notes on kernel 7, it appears that kernel 7.0.10 (allegedly) fixes a memory leak that occurs when a NIC is run at capacity. So this might be a problem that gets fixed once Proxmox releases a 7.0.10 kernel. Or not.
Also, we have another PBS that is on kernel 7 that does not have this issue. However, that PBS is on significantly better hardware so it is probably able to keep up with the data rates, while the old gen 8 server is not. Which makes it sound like the 7.0.10 kernel might be on the right track.
Attachments
-
PBS1-journalctl-errors.log68 KB · Views: 3
-
task-pve1-vzdump-kernel-6-17-successful-and-fast.log23.2 KB · Views: 1
-
task-pve1-vzdump-kernel-7-slows-to-crawl-and-hangs-aborted.log18.5 KB · Views: 0
-
task-pve3-kernel-7-slows-to-crawl-and-hangs-aborted.log6.9 KB · Views: 0
-
task-pve3-vzdump-kernel-6-17-completed-and-fast.log20.8 KB · Views: 0