Backup of VM fails: broken pipe

dejhost · Dec 31, 2022

I mounted a disk from a pve within my LAN, outside my cluster, using sshfs. On this disk, there are about 9TB of free disk space:

Code:

Usage 10.09% (931.79 GB of 9.24 TB)

I want to backup one of my VM's (about 2.5TB) onto this disk, but the process fails:

Task viewer:


VM/CT 110 - Backup
OutputStatus

Stop
INFO: starting new backup job: vzdump 110 --remove 0 --node proxmox03 --compress zstd --notes-template '{{cluster}}, {{guestname}}, {{node}}, {{vmid}}' --mode stop --storage migrate
INFO: Starting Backup of VM 110 (qemu)
INFO: Backup started at 2022-12-30 09:33:38
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: NC-host02
INFO: include disk 'scsi0' 'local-lvm:vm-110-disk-0' 32G
INFO: include disk 'virtio1' 'Raid1:vm-110-disk-0' 2500G
INFO: snapshots found (not included into backup)
INFO: creating vzdump archive '/mnt/migrate/dump/vzdump-qemu-110-2022_12_30-09_33_38.vma.zst'
INFO: starting kvm to execute backup task
INFO: started backup task '93df4bd7-c37f-48d5-b0f2-cffe078fe2d3'
INFO:   0% (235.6 MiB of 2.5 TiB) in 3s, read: 78.5 MiB/s, write: 67.7 MiB/s
INFO:   1% (25.3 GiB of 2.5 TiB) in 6m 6s, read: 70.9 MiB/s, write: 69.5 MiB/s
INFO:   2% (50.6 GiB of 2.5 TiB) in 11m 52s, read: 74.9 MiB/s, write: 73.6 MiB/s
....
...
...
NFO:  48% (1.2 TiB of 2.5 TiB) in 3h 58m, read: 102.1 MiB/s, write: 93.6 MiB/s
INFO:  49% (1.2 TiB of 2.5 TiB) in 4h 2m 14s, read: 102.0 MiB/s, write: 94.9 MiB/s
INFO:  50% (1.2 TiB of 2.5 TiB) in 4h 6m 34s, read: 100.4 MiB/s, write: 94.0 MiB/s
zstd: error 25 : Write error : Input/output error (cannot write compressed block)
INFO:  50% (1.2 TiB of 2.5 TiB) in 4h 7m 13s, read: 98.5 MiB/s, write: 95.0 MiB/s

ERROR: vma_queue_write: write error - Broken pipe
INFO: aborting backup job
INFO: stopping kvm after backup task
trying to acquire lock...
OK
ERROR: Backup of VM 110 failed - vma_queue_write: write error - Broken pipe
INFO: Failed at 2022-12-30 13:41:00
INFO: Backup job finished with error
TASK ERROR: job errors

I repeated the backup-task, with exactly the same outcome, at the same 50%. Last night, I initiated a third attempt, swichting to gzip as compression method. I have not reached the critical 50% yet.

Kernel Version: Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3

most recent /var/log/syslog shows:


Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     pg 2.7c not scrubbed since 2022-11-27T08:47:40.416372+0100
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : [WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     pool 'Raid1' is backfillfull
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     pool 'device_health_metrics' is backfillfull
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     pool 'cephfs_data' is backfillfull
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     pool 'cephfs_metadata' is backfillfull
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] : [WRN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
Dec 31 10:42:54 proxmox03 ceph-mon[1692505]: 2022-12-31T10:42:54.834+0100 7fb86e7a5700 -1 log_channel(cluster) log [ERR] :     Pool device_health_metrics has 8 placement groups, should have 32
Dec 31 10:48:22 proxmox03 pvedaemon[3164583]: <root@pam> starting task UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam:
Dec 31 10:48:22 proxmox03 pvedaemon[3886698]: starting termproxy UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam:
Dec 31 10:48:23 proxmox03 pvedaemon[3861219]: <root@pam> successful auth for user 'root@pam'
Dec 31 10:48:23 proxmox03 systemd[1]: Started Session 2619 of user root.
Dec 31 10:50:00 proxmox03 ceph-mon[1692505]: 2022-12-31T10:49:59.996+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 10:50:31 proxmox03 systemd[1]: Starting Cleanup of Temporary Directories...
Dec 31 10:50:31 proxmox03 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Dec 31 10:50:31 proxmox03 systemd[1]: Finished Cleanup of Temporary Directories.
Dec 31 10:51:30 proxmox03 pmxcfs[1680221]: [dcdb] notice: data verification successful
Dec 31 10:58:10 proxmox03 corosync[1680017]:   [KNET  ] link: host: 2 link: 0 is down
Dec 31 10:58:10 proxmox03 corosync[1680017]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Dec 31 10:58:10 proxmox03 corosync[1680017]:   [KNET  ] host: host: 2 has no active links
Dec 31 10:58:11 proxmox03 corosync[1680017]:   [KNET  ] rx: host: 2 link: 0 is up
Dec 31 10:58:11 proxmox03 corosync[1680017]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Dec 31 10:58:11 proxmox03 corosync[1680017]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Dec 31 10:58:11 proxmox03 corosync[1680017]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 31 11:00:00 proxmox03 ceph-mon[1692505]: 2022-12-31T10:59:59.999+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 11:10:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:09:59.995+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 11:17:01 proxmox03 CRON[3904114]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 31 11:20:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:19:59.994+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 11:30:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:29:59.993+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 11:40:00 proxmox03 ceph-mon[1692505]: 2022-12-31T11:39:59.993+0100 7fb8727ad700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR mons are allowing insecure global_id reclaim; Module 'devicehealth' has failed: ; mon proxmox03 is low on available space; 1 backfillfull osd(s); 3 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 48 pgs backfill_toofull; 48 pgs not deep-scrubbed in time; 48 pgs not scrubbed in time; 4 pool(s) backfillfull; 1 pools have too few placement groups
Dec 31 11:46:23 proxmox03 systemd[1]: session-2619.scope: Succeeded.
Dec 31 11:46:23 proxmox03 pvedaemon[3164583]: <root@pam> end task UPID:proxmox03:003B4E6A:128D3757:63B00566:vncshell::root@pam: OK
Dec 31 11:46:25 proxmox03 pvedaemon[3164583]: <root@pam> successful auth for user 'root@pam'
Dec 31 11:46:26 proxmox03 pvedaemon[3921926]: starting termproxy UPID:proxmox03:003BD806:12928828:63B01302:vncshell::root@pam:
Dec 31 11:46:26 proxmox03 pvedaemon[3164583]: <root@pam> starting task UPID:proxmox03:003BD806:12928828:63B01302:vncshell::root@pam:
Dec 31 11:46:26 proxmox03 pvedaemon[3069312]: <root@pam> successful auth for user 'root@pam'
Dec 31 11:46:26 proxmox03 systemd[1]: Started Session 2622 of user root.
root@proxmox03:~#

This thread suggests that the local drive needs to have sufficient space to temporarily store the entire VM. But if this is the case, I would need a workaround since I cannot store 2.5Tb on the local drive...

Could you please help me to troubleshoot this?

dejhost · Jan 3, 2023

A friend of mine solved this: He realized that my target drive is CEPHfs, and that the max file size was limited to 1TB.
the command
fs set <fs name> max_file_size <size in bytes>
increased the max file size of the target drive and the backup went fine afterwards.

hope this helps somebody else.

Search

Search

Backup of VM fails: broken pipe

dejhost

Active Member

dejhost

Active Member

We value your privacy