Smartctl was not showing errors, only the zpool status (read and write).
The above errors looked suspicious, so I cleared it and let the drive resilver back in to use. Then I wrote 74T to the pool, and scrubbed it. No errors still.
So I've now gone almost a week without any errors at all...
I took a working FreeNAS system and reformatted it for Proxmox backup server. It contains 36 4T Seagate SAS drives, and has been in use for almost three years.
After I started using it, it started getting errors on the drives, and failing them out of the zpool. Recently, it was failing a drive...
I thought that setting a BW limit of 1G on the backup server helped, but it's still doing it after another day or two.
Oddly, when I backup to a NFS mount (NetApp) I get no errors. The same VMs backed up to a Proxmox backup server gets timeouts.
Both the NFS and the Proxmox backup server have...
The cluster nodes each have dual 40g links to dual switches, in a trunk. The backup server has dual 10g links, so could be buried by 12 high-end nodes doing backup simultaneously.
The CEPH runs on a vlan on the same trunk as the backup, which is on another vlan.
Any tips on how to slow down...
It looks like all nodes backup simultaneously. Is there any way to spread out the backups, maybe have the nodes go sequentially?
It's not a race, I don't care how long it takes as long as it is less than a few hours.
I've not found how to increase the timeout. This is becoming very concerning though. Most every night I have a few VMs that fail to backup.
420 VM 420 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/420.conf.tmp.729468' - Device or resource busy
Sometimes the VM is shutdown, so nothing in the logs. Had one failure this weekend of a VM that has been off for a week.
6 failures last night. 17 fails out of 2,977 backups so far.
Worth noting, I'm backing up to a NFS mount twice a day too (offset by six hours), and no failures occurred on...
Just got a new failure:
118: 2021-11-19 12:32:06 INFO: Starting Backup of VM 118 (qemu)
118: 2021-11-19 12:32:06 INFO: status = running
118: 2021-11-19 12:32:06 INFO: VM Name: spk-ubuntu-test2
118: 2021-11-19 12:32:06 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-118-disk-0' 32G
My test bed backs up 4 times a day- twice to a NFS mount, and twice to the Proxmox Backup server.
No additional failures since the above, no network or other changes since then either.
This was not the first time backups have failed. Out of 2713 backups to the Proxmox Backup server, I have 15...
I have a cluster backing up to a dedicated Proxmox Backup server, which is normally working great.
Out of 54 VMs, three, all from the same node failed backup last night:
313: 2021-11-15 01:04:04 INFO: Starting Backup of VM 313 (qemu)
313: 2021-11-15 01:04:04 INFO: status = running
I've read that as well, and sounds like good advice, but I couldn't get it to work. I tried to remove/replace a drive by ID, and it wouldn't take it, but maybe that's because I didn't create the pool that way.
Managing 36 disk-name IDs on one command line when they are each 22+ chars would be...