Proxmox Backup Server is unstable

Jul 20, 2022
117
5
18
I'm running PBS 3.0-1 in a VM under PVE 8.0.4. I'm trying to run a days-long backup, but after a few days, PBS crashes in the middle of the night. There are no messages in Syslog -- just an abrupt end of the messages produced by the backup followed by the messages produced by the bootup that I initiated.

I need some guidance documenting this so that I can report the problem on Bugzilla.

Thanks.
 
are there any messages in the hosts syslog? are there any other tasks running concurrently? gc/verify/etc. ? whats the vm config?
 
Thank you so much for the quick reply.

I've attached a snippet of the host's syslog at the time the VM crashed. I don't know how to interpret it, but it seems that it might be a hardware error. (??)

Noting the SMART messages, the Disks display shows that all the disks "pass" SMART. Same goes for PBS which has it's disks passed-through PVE to PBS. Both are running ZFS, and the zpool statuses on both are clean.
 

Attachments

  • PBS crash syslog on Leghorn.zip
    1.2 KB · Views: 7
mhmm... a segfault indicates some kind of corruption:

* faulty memory (i'd do a memtest)
* faulty disk -> corrupt files (check the disks via smart (you did already) and check your installed packages with the 'debsums' tool; you have to install it via apt probably)
* faulty cpu (not really an easy way to check this, besides replacing it)
* bug in the program (with only a segfault it's hard to debug and only worth it if you mostly ruled out the underlying hardware)
 
I ran memtest for a few hours, and it didn't find anything untoward. But, I've read that memtest doesn't really put enough load on the system to catch many problems.

Then I read that segfault is just as likely to be caused by programming error. So, I decided to go down that street. First, I went back to PBS version 2. After putting a load on it for a couple of hours, it crashed again.

Then, I went back to PVE version 7. (Remember, I'm running PBS as a VM under PVE.) PBS has been running a heavy load for nearly a full day.

So, I'm going to mark this as solved. But I think Proxmox has a bit of work to do.
 
mhmm so it is not really solved...

if you could, would you mind trying to reproduce it once again, and posting the output of 'pveversion -v' and the version of the libc ? e.g. with 'dpkg -l | grep libc6'

with that info we can try see what exactly fails and check where is the bug

EDIT: or you can install systemd-coredump and post the crashdump or
attach to the process with gdb and get a backtrace (if that's possible for you)
 
Last edited:
The problem has evolved somewhat.

I think something might be going on with ZFS. Current situation: I backed up the PBS storage device that I'm having trouble. I used ZFS SEND/RECV to get into another pool that is managed by PVE, not PBS. Then I destroyed the PBS pool and rebuilt it in ZFS. Now I am trying to copy the data back using the same method. I can see the disks being busy, and htop show lots of data being transferred. (I have a few ssh sessions going so that I can try to keep an eye on things.) After around 20 GB to 30 GB, PBS becomes unresponsive and will soon crash. Its VM simply shuts down.

I have put V3 of PBS back in; I don't know why. It didn't make any difference.

PVE is still running at V7.

In the PVE syslog, this one message appears:
QEMU[2536599]: kvm: ../hw/usb/core.c:563: usb_packet_check_state: Assertion `!"usb packet state check failed"' failed.

After that, PVE proceeds to shut down PBS.

I can't tell if this is hardware or software. PBS's syslog still shows an abrupt reboot message with no indication that anything unusual is happening.

All five disks are USB-pass-throughed to PBS. There are no other USB devices on the machine.

There is a new 5-bay USB dock from Sabrent. https://sabrent.com/products/ds-sc5b. I've reported the problem to Sabrent, thinking I have a bad dock. They want me to take it all apart, change out different disk drive that I don't have and can't afford, and run it on Windows which isn't going to happen.

I've had another one of these docks for a year, and thought I'd get another one. Both of them are populated with five WD Red Plus 10 TB drives. The two docks are identical except one is managed by PVE and the other is managed by PBS (under PVE).

I thought the problem was in the USB HBA that I'm using. So, I tried plugging the dock into a motherboard USB port. That didn't help. Besides, PVE is using the same HBA without any problem.

Here is information per your prior request. These were taken from the PVE V7 system that I'm currently running.

Code:
root@leghorn:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.126-1-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-7
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
root@leghorn:~#

Code:
root@leghorn:~# dpkg -l | grep libc6~
root@leghorn:~# apt list libc6*
Listing... Done
libc6-amd64-cross/oldstable 2.31-9cross4 all
libc6-amd64-i386-cross/oldstable 2.31-9cross4 all
libc6-amd64-x32-cross/oldstable 2.31-9cross4 all
libc6-arm64-cross/oldstable 2.31-9cross4 all
libc6-armel-cross/oldstable 2.31-9cross4 all
libc6-armhf-cross/oldstable 2.31-9cross4 all
libc6-dbg/oldstable-security 2.31-13+deb11u7 amd64
libc6-dev-amd64-cross/oldstable 2.31-9cross4 all
libc6-dev-amd64-i386-cross/oldstable 2.31-9cross4 all
libc6-dev-amd64-x32-cross/oldstable 2.31-9cross4 all
libc6-dev-arm64-cross/oldstable 2.31-9cross4 all
libc6-dev-armel-cross/oldstable 2.31-9cross4 all
libc6-dev-armhf-cross/oldstable 2.31-9cross4 all
libc6-dev-hppa-cross/oldstable 2.31-9cross4 all
libc6-dev-i386-amd64-cross/oldstable 2.31-9cross4 all
libc6-dev-i386-cross/oldstable 2.31-9cross4 all
libc6-dev-i386-x32-cross/oldstable 2.31-9cross4 all
libc6-dev-i386/oldstable-security 2.31-13+deb11u7 amd64
libc6-dev-m68k-cross/oldstable 2.31-9cross4 all
libc6-dev-mips-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mips64-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mips64el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mips32-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mips-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsel-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64el-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsel-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mips-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mips64-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mips64el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mipsel-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-dev-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-dev-powerpc-cross/oldstable 2.31-9cross4 all
libc6-dev-powerpc-ppc64-cross/oldstable 2.31-9cross4 all
libc6-dev-ppc64-cross/oldstable 2.31-9cross4 all
libc6-dev-ppc64-powerpc-cross/oldstable 2.31-9cross4 all
libc6-dev-ppc64el-cross/oldstable 2.31-9cross4 all
libc6-dev-riscv64-cross/oldstable 2.31-9cross4 all
libc6-dev-s390-s390x-cross/oldstable 2.31-9cross4 all
libc6-dev-s390x-cross/oldstable 2.31-9cross4 all
libc6-dev-sh4-cross/oldstable 2.31-9cross4 all
libc6-dev-sparc-sparc64-cross/oldstable 2.31-9cross4 all
libc6-dev-sparc64-cross/oldstable 2.31-9cross4 all
libc6-dev-x32-amd64-cross/oldstable 2.31-9cross4 all
libc6-dev-x32-cross/oldstable 2.31-9cross4 all
libc6-dev-x32-i386-cross/oldstable 2.31-9cross4 all
libc6-dev-x32/oldstable-security 2.31-13+deb11u7 amd64
libc6-dev/oldstable-security 2.31-13+deb11u7 amd64
libc6-hppa-cross/oldstable 2.31-9cross4 all
libc6-i386-amd64-cross/oldstable 2.31-9cross4 all
libc6-i386-cross/oldstable 2.31-9cross4 all
libc6-i386-x32-cross/oldstable 2.31-9cross4 all
libc6-i386/oldstable-security 2.31-13+deb11u7 amd64
libc6-m68k-cross/oldstable 2.31-9cross4 all
libc6-mips-cross/oldstable 2.31-11cross1 all
libc6-mips32-mips64-cross/oldstable 2.31-11cross1 all
libc6-mips32-mips64el-cross/oldstable 2.31-11cross1 all
libc6-mips32-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-mips32-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-mips32-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-mips32-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-mips32-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-mips32-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-mips64-cross/oldstable 2.31-11cross1 all
libc6-mips64-mips-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsel-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-mips64-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-mips64el-cross/oldstable 2.31-11cross1 all
libc6-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-mipsel-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mips-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mips64-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mips64el-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mips64r6-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mips64r6el-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mipsel-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-mipsn32-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-mipsn32el-cross/oldstable 2.31-11cross1 all
libc6-mipsn32r6-cross/oldstable 2.31-11cross1 all
libc6-mipsn32r6el-cross/oldstable 2.31-11cross1 all
libc6-mipsr6-cross/oldstable 2.31-11cross1 all
libc6-mipsr6el-cross/oldstable 2.31-11cross1 all
libc6-powerpc-cross/oldstable 2.31-9cross4 all
libc6-powerpc-ppc64-cross/oldstable 2.31-9cross4 all
libc6-ppc64-cross/oldstable 2.31-9cross4 all
libc6-ppc64-powerpc-cross/oldstable 2.31-9cross4 all
libc6-ppc64el-cross/oldstable 2.31-9cross4 all
libc6-riscv64-cross/oldstable 2.31-9cross4 all
libc6-s390-s390x-cross/oldstable 2.31-9cross4 all
libc6-s390x-cross/oldstable 2.31-9cross4 all
libc6-sh4-cross/oldstable 2.31-9cross4 all
libc6-sparc-sparc64-cross/oldstable 2.31-9cross4 all
libc6-sparc64-cross/oldstable 2.31-9cross4 all
libc6-x32-amd64-cross/oldstable 2.31-9cross4 all
libc6-x32-cross/oldstable 2.31-9cross4 all
libc6-x32-i386-cross/oldstable 2.31-9cross4 all
libc6-x32/oldstable-security 2.31-13+deb11u7 amd64
libc6.1-alpha-cross/oldstable 2.31-9cross4 all
libc6.1-dev-alpha-cross/oldstable 2.31-9cross4 all
libc6/oldstable-security,now 2.31-13+deb11u7 amd64 [installed]
root@leghorn:~#
 
Hi,
In the PVE syslog, this one message appears:
QEMU[2536599]: kvm: ../hw/usb/core.c:563: usb_packet_check_state: Assertion `!"usb packet state check failed"' failed.
that is an assertion failure in the USB subsystem in QEMU, meaning it encountered an unexpected situation and aborted because of it. Could you share the VM configuration?
I can't tell if this is hardware or software. PBS's syslog still shows an abrupt reboot message with no indication that anything unusual is happening.

All five disks are USB-pass-throughed to PBS. There are no other USB devices on the machine.

There is a new 5-bay USB dock from Sabrent. https://sabrent.com/products/ds-sc5b. I've reported the problem to Sabrent, thinking I have a bad dock. They want me to take it all apart, change out different disk drive that I don't have and can't afford, and run it on Windows which isn't going to happen.

I've had another one of these docks for a year, and thought I'd get another one. Both of them are populated with five WD Red Plus 10 TB drives. The two docks are identical except one is managed by PVE and the other is managed by PBS (under PVE).

I thought the problem was in the USB HBA that I'm using. So, I tried plugging the dock into a motherboard USB port. That didn't help. Besides, PVE is using the same HBA without any problem.
It's likely related to the USB passthrough, but the question remains if it's a QEMU bug or hardware issue or if QEMU could handle the error more gracefully. To get a more complete picture, you can attach GDB to the VM with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /var/run/qemu-server/<put your VMID here>.pid)
(replacing <put your VMID here> with the actual ID) before the crash and then when the crash happens, type thread apply all backtrace in the GDB session.

EDIT: there also is a tracepoint in QEMU before the assertion, so you can use
Code:
qm set <ID> --args '-trace usb_packet_state_fault,file=/tmp/usb-fault.log'
(needs to be done before starting the VM). QEMU will create a log file /tmp/usb-fault.log which should contain a bit more information about the state after the crash happened.
 
Last edited:
Fiona, thanks for that plan. I'll use it if necessary. I know you're trying your best to debug this.

Before I read your reply, I began the process of trying out something different. I woke up this morning with the thought that it might not be a good idea to run ZFS in a VM with passed-through disks. So, I decided to go the more traditional route. I have:

  1. removed the pass-throughs from the PBS VM.
  2. imported the PBS pool in PVE
  3. defined a virtual disk in the PBS VM that has almost all the available space given in zfs list.
    Now, PVE is managing both of the ZFS pools.
I am currently in the process of using scp to copy the backed up data that is in the PVE ZFS pool to the EXT4 filesystem (new virtual disk) inside of PBS. It's moving along smoothly so far. We are at 472 GB out of a total of 12.3 TB. That's the farthest I've gotten so far. Keeping my figures crossed.

After this copy, I will be running zpool scrub on the PBS pool, and also a full verify of the newly-copied datastore on PBS. (Actually, there are two datastores, and a bunch of data that are not in datastores.)

If I run into problems with this process, I'll be continuing with the debugging with you.

Thanks again for your ongoing help!
 
Last edited:
  • Like
Reactions: fiona
Well, after all of that, it turned out that the problem was still happening, but with different symptoms. But I think I've finally found and fixed the problem. I'm giving it more time, in case it comes back after 24 hours, like the last time. I'll be back soon with the details. Power on.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!