Backup of VM fails: broken pipe

dejhost

Active Member
Dec 13, 2020
67
1
28
45
I mounted a disk from a pve within my LAN, outside my cluster, using sshfs. On this disk, there are about 9TB of free disk space:
Code:
Usage 10.09% (931.79 GB of 9.24 TB)

I want to backup one of my VM's (about 2.5TB) onto this disk, but the process fails:

Task viewer:
VM/CT 110 - Backup OutputStatus Stop INFO: starting new backup job: vzdump 110 --remove 0 --node proxmox03 --compress zstd --notes-template '{{cluster}}, {{guestname}}, {{node}}, {{vmid}}' --mode stop --storage migrate INFO: Starting Backup of VM 110 (qemu) INFO: Backup started at 2022-12-30 09:33:38 INFO: status = stopped INFO: backup mode: stop INFO: ionice priority: 7 INFO: VM Name: NC-host02 INFO: include disk 'scsi0' 'local-lvm:vm-110-disk-0' 32G INFO: include disk 'virtio1' 'Raid1:vm-110-disk-0' 2500G INFO: snapshots found (not included into backup) INFO: creating vzdump archive '/mnt/migrate/dump/vzdump-qemu-110-2022_12_30-09_33_38.vma.zst' INFO: starting kvm to execute backup task INFO: started backup task '93df4bd7-c37f-48d5-b0f2-cffe078fe2d3' INFO: 0% (235.6 MiB of 2.5 TiB) in 3s, read: 78.5 MiB/s, write: 67.7 MiB/s INFO: 1% (25.3 GiB of 2.5 TiB) in 6m 6s, read: 70.9 MiB/s, write: 69.5 MiB/s INFO: 2% (50.6 GiB of 2.5 TiB) in 11m 52s, read: 74.9 MiB/s, write: 73.6 MiB/s .... ... ... NFO: 48% (1.2 TiB of 2.5 TiB) in 3h 58m, read: 102.1 MiB/s, write: 93.6 MiB/s INFO: 49% (1.2 TiB of 2.5 TiB) in 4h 2m 14s, read: 102.0 MiB/s, write: 94.9 MiB/s INFO: 50% (1.2 TiB of 2.5 TiB) in 4h 6m 34s, read: 100.4 MiB/s, write: 94.0 MiB/s zstd: error 25 : Write error : Input/output error (cannot write compressed block) INFO: 50% (1.2 TiB of 2.5 TiB) in 4h 7m 13s, read: 98.5 MiB/s, write: 95.0 MiB/s ERROR: vma_queue_write: write error - Broken pipe INFO: aborting backup job INFO: stopping kvm after backup task trying to acquire lock... OK ERROR: Backup of VM 110 failed - vma_queue_write: write error - Broken pipe INFO: Failed at 2022-12-30 13:41:00 INFO: Backup job finished with error TASK ERROR: job errors


I repeated the backup-task, with exactly the same outcome, at the same 50%. Last night, I initiated a third attempt, swichting to gzip as compression method. I have not reached the critical 50% yet.

Kernel Version: Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3



most recent /var/log/syslog shows:
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : pg 2.7c not scrubbed since 2022-11-27T08:47:40.416372+0100 Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : [WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : pool 'Raid1' is backfillfull Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : pool 'device_health_metrics' is backfillfull Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : pool 'cephfs_data' is backfillfull Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : pool 'cephfs_metadata' is backfillfull Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : [WRN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : Pool device_health_metrics has 8 placement groups, should have 32 Dec 31 10:48:22 proxmox03 pvedaemon[3164583]: <root@pam> starting task UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam: Dec 31 10:48:22 proxmox03 pvedaemon[3886698]: starting termproxy UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam: Dec 31 10:48:23 proxmox03 pvedaemon[3861219]: <root@pam> successful auth for user 'root@pam' Dec 31 10:48:23 proxmox03 systemd[1]: Started Session 2619 of user root. Dec 31 10:50:00 proxmox03 ceph-mon[1692505]: 2022-12-31T10:49:59.996+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 10:50:31 proxmox03 systemd[1]: Starting Cleanup of Temporary Directories... Dec 31 10:50:31 proxmox03 systemd[1]: systemd-tmpfiles-clean.service: Succeeded. Dec 31 10:50:31 proxmox03 systemd[1]: Finished Cleanup of Temporary Directories. Dec 31 10:51:30 proxmox03 pmxcfs[1680221]: [dcdb] notice: data verification successful Dec 31 10:58:10 proxmox03 corosync[1680017]: [KNET ] link: host: 2 link: 0 is down Dec 31 10:58:10 proxmox03 corosync[1680017]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Dec 31 10:58:10 proxmox03 corosync[1680017]: [KNET ] host: host: 2 has no active links Dec 31 10:58:11 proxmox03 corosync[1680017]: [KNET ] rx: host: 2 link: 0 is up Dec 31 10:58:11 proxmox03 corosync[1680017]: [KNET ] link: Resetting MTU for link 0 because host 2 joined Dec 31 10:58:11 proxmox03 corosync[1680017]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Dec 31 10:58:11 proxmox03 corosync[1680017]: [KNET ] pmtud: Global data MTU changed to: 1397 Dec 31 11:00:00 proxmox03 ceph-mon[1692505]: 2022-12-31T10:59:59.999+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 11:10:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:09:59.995+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 11:17:01 proxmox03 CRON[3904114]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 31 11:20:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:19:59.994+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 11:30:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:29:59.993+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 11:40:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:39:59.993+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups Dec 31 11:46:23 proxmox03 systemd[1]: session-2619.scope: Succeeded. Dec 31 11:46:23 proxmox03 pvedaemon[3164583]: <root@pam> end task UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam: OK Dec 31 11:46:25 proxmox03 pvedaemon[3164583]: <root@pam> successful auth for user 'root@pam' Dec 31 11:46:26 proxmox03 pvedaemon[3921926]: starting termproxy UPID:proxmox03:003BD806:12928828:63B01302:vncshell::root@pam: Dec 31 11:46:26 proxmox03 pvedaemon[3164583]: <root@pam> starting task UPID:proxmox03:003BD806:12928828:63B01302:vncshell::root@pam: Dec 31 11:46:26 proxmox03 pvedaemon[3069312]: <root@pam> successful auth for user 'root@pam' Dec 31 11:46:26 proxmox03 systemd[1]: Started Session 2622 of user root. root@proxmox03:~#

This thread suggests that the local drive needs to have sufficient space to temporarily store the entire VM. But if this is the case, I would need a workaround since I cannot store 2.5Tb on the local drive...

Could you please help me to troubleshoot this?
 
A friend of mine solved this: He realized that my target drive is CEPHfs, and that the max file size was limited to 1TB.
the command
fs set <fs name> max_file_size <size in bytes>
increased the max file size of the target drive and the backup went fine afterwards.

hope this helps somebody else.