Backup hangs on first VM and cannot be stopped/killed

gargravarr

New Member
Jan 22, 2024
3
0
1
Hi folks,

My cluster is a set of 4 HP 260 G1s (i3-4030U, 16GB, 240GB boot SSD) connected to a TrueNAS Scale system providing shared iSCSI LUNs on both SSDs and HDDs for VHD storage. Originally I had each node backing up to a separate NAS via NFS.

I've now removed that NAS as it was giving me some issues (suspected hardware fault). However, I haven't been able to run a complete backup ever since. On each node, the backup process starts, but gets a variable % into the first VM and then hangs. Logging into the hypervisor, I can see the process is marked D, so it's waiting on disk, but it makes no more progress, and isn't using CPU. I cannot kill the process and the only way to stop the backup is to reboot the host, otherwise the system load just climbs, and on occasion it's forced the host to reboot. The IO backlog slows everything down to the point that even an ls on my home folder on the host hangs.

I moved the NFS share from the original NAS onto first my TrueNAS machine temporarily, but that's when the issue started. I figured it must be having some trouble with reading and writing to the same machine. I've since attached a USB drive to my router and configured an NFS share there. It mounts and the hosts can write to it. However, the same problem occurs.

I can't find any error messages on the host indicating a problem, just that it's getting to a point and then stopping. I initially thought it might be filling up the root disk with large VHDs, so I've carved a 100GB LV out of the system SSD and mounted it at /dump, then set dumpdir: /dump in vzdump.conf on each node. Curiously enough, I don't see any disk space used anywhere during the backup process, not on the LV, and only an empty file created on the NFS share, yet the logs indicate the backup process reads up to 20GB of the VHD before stalling. Pretty confusing symptoms. I enabled VM fleecing while looking through the config, but it hasn't made any difference.

As a result, my VM backups are weeks old by now and I really need to back them up somehow, but with this happening on all 4 nodes, there seems to be something fundamentally broken on my setup. I don't think I did anything special with the NFS config, just set no_root_squash so root in the hosts can write to it.

PVE 8.2, kernels and packages all updated as of today, new kernel has made no difference.

TIA!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!