Weird ZFS & backup interactions

Nov 1, 2023
10
0
1
Not sure where to begin...

I have a DELL XPS 8930 i7 64GB RAM.

I added a QNAP TL-D800S JBOD with a QXP800eS-A1164 pci card.

Setup a 3x8TB (all 3 drives are brand new) ZFS Pool (mounted as /storage) from the QNAP directly on Proxmox (not in a VM or LXC).

This is extra storage, I don't boot off of this pool or anything like that.

I setup vsftpd on proxmox and ftp'd about 5TB of media files over to the /strorage (actually into a dataset called /storage/media). Setup PLEX on an LXC and used bind mounts from proxmox to the plex LXC everything works fine. Plex works, files are fine, rainbows and unicorns.

Now that I've got the beginning of something working so I decided to start doing backups.

I have two LXC's and one VM.

When I backup the LXCs, no problem.

When I backup the VM, my zfs pool gets corrupted and suspended (every time) write or checksum errors. VM Isn't running.

What's weird though is I'm not backing up to the local zpool (/storage), I'm backing up to an NFS share from my Synology.

The backup always succeeds, and completes, but my zpool gets corrupted every time. Only when backing up the one VM, never when backing up an LXC.

I have to reboot the machine to recover but everytime it reboots, the system comes up fine, zpool is fine everything works.

What could explain my zpool getting messed up when I'm not writing to it? More importantly, how do I fix it?

Happy to provide logs just not sure which ones would help?
 
Hi,
so you say your zpool gets corrupted by the VM backup reading from the pool while writing to a different storage? That does sound very strange. Please share the systemd journal since boot journalctl -b > journal.txt, your storage config cat /etc/pve/storage.cfg, the VM config qm config <VMID> for one of the VMs which corrupt the pool as well as the zpool status and last but not least your pveversion -v.

I also would recommend to run a prolonged memory test to check if your RAM is fine, many such errors are related to bad RAM and zfs complaining about checksum mismatches.
 
Thanks Chris,

Logs attached.

One point to clarify.

The zpool that is getting corrupted should NOT be in the path of the VM backup I'm running. The VM doesn't know about the zpool, it's on proxmox proper (LVM), and the backup I'm running is essentially from local storage (SSD) to an NFS mount.
 

Attachments

  • logfiles.tar
    350 KB · Views: 2
Thanks Chris,

Logs attached.

One point to clarify.

The zpool that is getting corrupted should NOT be in the path of the VM backup I'm running. The VM doesn't know about the zpool, it's on proxmox proper (LVM), and the backup I'm running is essentially from local storage (SSD) to an NFS mount.
Thank you for the logs, this is indeed a rather strange issue you encounter here. So you say this is reproducible every time you start a VM backup?

A few things I would suggest before going forward are:
The logs indeed show that right after the pvedaemon starts the backup the Linux kernel disables a disk ata31.00: disable device and then also another one ata32.00: disable device, leading to the zpools IO being suspended because it has not enough disks anymore.

Please provide also the output of lsblk -o +FSTYPE,LABEL and ls -la /dev/disk/by-id/.
 
Thanks Chris, been busy.

I ran full memtest with PASS on everything, no errors.

I've also upgraded intel microcode and to the latest BIOS from DELL.

Don't have a different PCI slot for the QXP card unfortunately, but again I was able to transfer 5TB of stuff to it without issue.

items requested attached.

Appreciate the help!

Frank.

P.S. Yes, getting my proxmox SMTP/postfix stuff sorted in next on my list after this issue is resolved
 

Attachments

  • lsblk.txt
    2.3 KB · Views: 1
  • by-id.txt
    4 KB · Views: 1
Is the issue always the same? So when starting the VM backup job, the zpool disks detach with the ata31.00: disable device info logged in the systemd journal? This indicates somehow that the disk disappears right after the backup starts, but since the storage is not related at all to the backup job, there has to be some correlation (power consumption?). You might want to try with an older kernel to see if the issue persists.
 
Yeah same error:

kernel: ata31.00 disable device

right after the backup starts.


I have the following kernels, is one of these "old" enough to test against by downgrading?

# proxmox-boot-tool kernel list Manually selected kernels: None. Automatically selected kernels: 6.2.16-15-pve 6.2.16-18-pve
 
Okay then, feel like I'm a bit of an Idiot but I've solved the issue.

The VM I was trying to backup had two pci's devices passed through.

My QXP device has two ports and sometime in the past I was trying to passthrough the PCIE ports to the VM in question.

While I haven't started this VM since before this problem occurred once I remove those two pass-throughs the backup went fine.

Obviously those pass-throughs were connected to the zpool since those ports I was mapping were on the pcie card with the sata connectors

To my mind the backup should not have read or written to those devices since the VM was off and there was nothing there but raw hardware (nothing on the VM used those passthroughs) but apparently the proxmox backup was fowling the mapped devices to the zpool on the other end of those two devices and either trying to write to them or otherwise messing them up...

Kinda feels like a bug, but I'm new to proxmox so maybe I missed something in the manual.

I do appreciate all the help. process of elimation, one step at a time...

Thanks,
Frank.
 

Attachments

  • Screenshot 2023-11-13 172002.png
    Screenshot 2023-11-13 172002.png
    6.1 KB · Views: 2
Okay then, feel like I'm a bit of an Idiot but I've solved the issue.

The VM I was trying to backup had two pci's devices passed through.

My QXP device has two ports and sometime in the past I was trying to passthrough the PCIE ports to the VM in question.

While I haven't started this VM since before this problem occurred once I remove those two pass-throughs the backup went fine.

Obviously those pass-throughs were connected to the zpool since those ports I was mapping were on the pcie card with the sata connectors

To my mind the backup should not have read or written to those devices since the VM was off and there was nothing there but raw hardware (nothing on the VM used those passthroughs) but apparently the proxmox backup was fowling the mapped devices to the zpool on the other end of those two devices and either trying to write to them or otherwise messing them up...

Kinda feels like a bug, but I'm new to proxmox so maybe I missed something in the manual.

I do appreciate all the help. process of elimation, one step at a time...

Thanks,
Frank.
Okay, well this explains the behavior. By passing through a PCI device to the VM, the host will give up its control over it, therefore throwing out your disks. This is not a bug but rather how pci passthrough works.

Glad you were able to solve this!
 
Thanks Chris,
Could you explain that better just so I understand for the future?

I passed through two disk controllers to the VM. Those devices exist in the VM, but they weren't, in this case, used. Even if they were used, I'm not clear on why doing a "stopped" backup would write to those devices or otherwise cause them to corrupt the filesystems on the other side of the interface.

Really appreciate the help and support!

Thanks,
Frank.
 
Thanks Chris,
Could you explain that better just so I understand for the future?

I passed through two disk controllers to the VM. Those devices exist in the VM, but they weren't, in this case, used. Even if they were used, I'm not clear on why doing a "stopped" backup would write to those devices or otherwise cause them to corrupt the filesystems on the other side of the interface.

Really appreciate the help and support!

Thanks,
Frank.
VM backups in stop mode will start the qemu process for that VM in order to write the backups, so while the qemu process will not touch your devices, the host will disconnect the drives attached to you PCI card. The same would have happened if you just started the VM itself.
 
I second that. vzdump should know to ignore pci passthrough devices. @fmgoodman would be a good idea to file it at bugzilla.proxmox.com
You mean that the qemu process for stopped backups should ignore pci passedtrough devices? Well that might work, but as soon as the VM is started regularly the same will occur, so the disks would be disconnected anyway. And keeping the VM stopped with the pci passtrough seems not that useful.
 
I mean, once the snapshot has been made of the local file system, the backup process should not impact the operation of the virtual machine regardless of configuration. If this is not possible (if a VM cannot be frozen for backup) then it should be so documented.
 
I mean, once the snapshot has been made of the local file system, the backup process should not impact the operation of the virtual machine regardless of configuration. If this is not possible (if a VM cannot be frozen for backup) then it should be so documented.
I am not sure I get your point here. The VM in question had PCI configured for the card connected to the drives which were part of the zpool on the host. So as soon as the qemu process was started, the Linux kernel detached the drives and passed the PCI card trough to qemu, so the pool gets degraded because it looses the disks, not because the VM writes to the disks. As stated, this would have happened also if the VM would have been started up normally.
 
Last edited:
The VM in question had PCI configured for the card connected to the drives which were part of the zpool on the host. So as soon as the qemu process was started, the Linux kernel detached the drives and passed the PCI card trough to qemu, so the pool gets degraded because it looses the disks, not because the VM writes to the disks. As stated, this would have happened also if the VM would have been started up normally.
Holy moly; I didnt get that from the OPs initial description. In this case the issue has nothing to do with backups at all. Please ignore all my posts in this thread ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!