Backup failing randomly every night

Boman

New Member
Dec 13, 2020
4
0
1
45
Hi all,

I have a two node cluster running on Hetzner and use their Storage-boxes as a backup target. The storage box is a SMB/CIFS-mount configured with default settings via the GUI. The mountpoint (as config:ed in the GUI) looks like this

//u123456.your-storagebox.de/u123456-sub1 on /mnt/pve/backup type cifs (rw,relatime,vers=3.1.1,cache=strict,username=u123456-sub1,uid=0,noforceuid,gid=0,noforcegid,addr=159.69.63.161,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)

The storage on the host-nodes are NVME-backed ZFS-pools. I have ~20 VM:s running (or stopped) on the cluster. When doing nightly backup the backup fails random for the different VM:s (i.e. it's not the same WMs that are failing every night). The log from a failing backup job does not give to mush clues...

Code:
2022-02-27 01:07:54 INFO: Starting Backup of VM 108 (qemu)
2022-02-27 01:07:54 INFO: status = stopped
2022-02-27 01:07:54 INFO: backup mode: stop
2022-02-27 01:07:54 INFO: ionice priority: 7
2022-02-27 01:07:54 INFO: VM Name: openhab
2022-02-27 01:07:54 INFO: include disk 'scsi0' 'local-zfs:vm-108-disk-0' 20G
2022-02-27 01:07:55 INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-108-2022_02_27-01_07_54.vma.zst'
2022-02-27 01:07:55 INFO: starting kvm to execute backup task
2022-02-27 01:07:56 INFO: started backup task 'c0ec535e-c6b5-4920-ab2c-1e4db281425e'
2022-02-27 01:07:59 INFO:   6% (1.2 GiB of 20.0 GiB) in 3s, read: 413.2 MiB/s, write: 255.4 MiB/s
... more percentage... 
2022-02-27 01:08:37 INFO: 100% (20.0 GiB of 20.0 GiB) in 41s, read: 1.8 GiB/s, write: 4.0 KiB/s
2022-02-27 01:08:37 INFO: backup is sparse: 12.92 GiB (64%) total zero data
2022-02-27 01:08:37 INFO: transferred 20.00 GiB in 41 seconds (499.5 MiB/s)
2022-02-27 01:08:54 INFO: stopping kvm after backup task
2022-02-27 01:08:56 ERROR: Backup of VM 108 failed - zstd --rsyncable --threads=1 failed - wrong exit status 1

The syslog around the same time show some suspicious errors:
Code:
Feb 27 01:07:54 eisenhower vzdump[4111356]: INFO: Starting Backup of VM 108 (qemu)
Feb 27 01:07:55 eisenhower systemd[1]: Started 108.scope.
Feb 27 01:07:55 eisenhower systemd-udevd[296231]: Using default interface naming scheme 'v247'.
Feb 27 01:07:55 eisenhower systemd-udevd[296231]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 27 01:07:55 eisenhower kernel: [2118250.748652] device tap108i0 entered promiscuous mode
Feb 27 01:07:55 eisenhower systemd-udevd[296234]: Using default interface naming scheme 'v247'.
Feb 27 01:07:55 eisenhower systemd-udevd[296234]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 27 01:07:55 eisenhower systemd-udevd[296234]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 27 01:07:55 eisenhower systemd-udevd[296231]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 27 01:07:55 eisenhower kernel: [2118250.769625] fwbr108i0: port 1(fwln108i0) entered blocking state
...some more unreleated network ...
Feb 27 01:07:55 eisenhower kernel: [2118250.783495] fwbr108i0: port 2(tap108i0) entered forwarding state
Feb 27 01:07:59 eisenhower kernel: [2118254.951309] warn_alloc: 100 callbacks suppressed
Feb 27 01:07:59 eisenhower kernel: [2118254.951312] kworker/u64:4: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Feb 27 01:07:59 eisenhower kernel: [2118254.951320] CPU: 10 PID: 4094162 Comm: kworker/u64:4 Tainted: P           O      5.13.19-3-pve #1
Feb 27 01:07:59 eisenhower kernel: [2118254.951322] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 2424 09/29/2021
Feb 27 01:07:59 eisenhower kernel: [2118254.951324] Workqueue: writeback wb_workfn (flush-cifs-1)
Feb 27 01:07:59 eisenhower kernel: [2118254.951329] Call Trace:
Feb 27 01:07:59 eisenhower kernel: [2118254.951331]  <TASK>
Feb 27 01:07:59 eisenhower kernel: [2118254.951333]  dump_stack+0x7d/0x9c
Feb 27 01:07:59 eisenhower kernel: [2118254.951337]  warn_alloc.cold+0x7b/0xdf
Feb 27 01:07:59 eisenhower kernel: [2118254.951340]  __alloc_pages_slowpath.constprop.0+0xd57/0xd80
Feb 27 01:07:59 eisenhower kernel: [2118254.951344]  __alloc_pages+0x30e/0x330
Feb 27 01:07:59 eisenhower kernel: [2118254.951345]  alloc_pages+0x87/0x110
Feb 27 01:07:59 eisenhower kernel: [2118254.951348]  kmalloc_order+0x2d/0xb0
Feb 27 01:07:59 eisenhower kernel: [2118254.951351]  kmalloc_order_trace+0x1d/0x80
Feb 27 01:07:59 eisenhower kernel: [2118254.951353]  __kmalloc+0x22b/0x260
Feb 27 01:07:59 eisenhower kernel: [2118254.951356]  crypt_message+0x3a4/0x870 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951387]  ? _crng_backtrack_protect+0x56/0xa0
Feb 27 01:07:59 eisenhower kernel: [2118254.951391]  smb3_init_transform_rq+0x26c/0x2f0 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951416]  smb_send_rqst+0xe9/0x1a0 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951441]  cifs_call_async+0x125/0x280 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951464]  ? alloc_path_with_tree_prefix+0x170/0x170 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951489]  smb2_async_writev+0x2ae/0x5c0 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951515]  ? validate_t2+0x70/0x70 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951538]  cifs_writepages+0x86d/0xc70 [cifs]
Feb 27 01:07:59 eisenhower kernel: [2118254.951563]  do_writepages+0x38/0xc0
Feb 27 01:07:59 eisenhower kernel: [2118254.951566]  ? fprop_reflect_period_percpu.isra.0+0x8f/0xc0
Feb 27 01:07:59 eisenhower kernel: [2118254.951568]  __writeback_single_inode+0x44/0x2a0
Feb 27 01:07:59 eisenhower kernel: [2118254.951569]  writeback_sb_inodes+0x223/0x4d0
Feb 27 01:07:59 eisenhower kernel: [2118254.951571]  __writeback_inodes_wb+0x56/0xf0
Feb 27 01:07:59 eisenhower kernel: [2118254.951572]  wb_writeback+0x1dd/0x290
Feb 27 01:07:59 eisenhower kernel: [2118254.951574]  wb_workfn+0x3a7/0x500
Feb 27 01:07:59 eisenhower kernel: [2118254.951575]  ? ttwu_do_activate+0x6e/0xc0
Feb 27 01:07:59 eisenhower kernel: [2118254.951578]  process_one_work+0x220/0x3c0
Feb 27 01:07:59 eisenhower kernel: [2118254.951579]  worker_thread+0x53/0x420
Feb 27 01:07:59 eisenhower kernel: [2118254.951580]  ? process_one_work+0x3c0/0x3c0
Feb 27 01:07:59 eisenhower kernel: [2118254.951581]  kthread+0x12b/0x150
Feb 27 01:07:59 eisenhower kernel: [2118254.951583]  ? set_kthread_struct+0x50/0x50
Feb 27 01:07:59 eisenhower kernel: [2118254.951585]  ret_from_fork+0x22/0x30
Feb 27 01:07:59 eisenhower kernel: [2118254.951588]  </TASK>
Feb 27 01:07:59 eisenhower kernel: [2118254.951588] Mem-Info:
Feb 27 01:07:59 eisenhower kernel: [2118254.951590] active_anon:4444050 inactive_anon:1274168 isolated_anon:0
Feb 27 01:07:59 eisenhower kernel: [2118254.951590]  active_file:406668 inactive_file:1330440 isolated_file:0
Feb 27 01:07:59 eisenhower kernel: [2118254.951590]  unevictable:37770 dirty:93475 writeback:2184
Feb 27 01:07:59 eisenhower kernel: [2118254.951590]  slab_reclaimable:168124 slab_unreclaimable:1281967
Feb 27 01:07:59 eisenhower kernel: [2118254.951590]  mapped:40039 shmem:16601 pagetables:19665 bounce:0
Feb 27 01:07:59 eisenhower kernel: [2118254.951590]  free:113656 free_pcp:124 free_cma:0
Feb 27 01:07:59 eisenhower kernel: [2118254.951593] Node 0 active_anon:17776200kB inactive_anon:5096672kB active_file:1626672kB inactive_file:5321760kB unevictable:151080kB isolated(anon):0kB isolated(file):0kB mapped:160156kB dirty:373900kB writeback:8736kB shmem:66404kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2691072kB writeback_tmp:0kB kernel_stack:10532kB pagetables:78660kB all_unreclaimable? no
Feb 27 01:07:59 eisenhower kernel: [2118254.951596] Node 0 DMA free:11260kB min:12kB low:24kB high:36kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951599] lowmem_reserve[]: 0 3120 64146 64146 64146
Feb 27 01:07:59 eisenhower kernel: [2118254.951601] Node 0 DMA32 free:247488kB min:3284kB low:6476kB high:9668kB reserved_highatomic:2048KB active_anon:947160kB inactive_anon:289640kB active_file:22476kB inactive_file:180952kB unevictable:0kB writepending:40828kB present:3314276kB managed:3248740kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951604] lowmem_reserve[]: 0 0 61026 61026 61026
Feb 27 01:07:59 eisenhower kernel: [2118254.951606] Node 0 Normal free:195876kB min:64280kB low:126768kB high:189256kB reserved_highatomic:2048KB active_anon:16829040kB inactive_anon:4805592kB active_file:1604196kB inactive_file:5140968kB unevictable:151080kB writepending:341820kB present:63688192kB managed:62498132kB mlocked:151040kB bounce:0kB free_pcp:448kB local_pcp:0kB free_cma:0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951609] lowmem_reserve[]: 0 0 0 0 0
Feb 27 01:07:59 eisenhower kernel: [2118254.951610] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 2*4096kB (M) = 11260kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951618] Node 0 DMA32: 2742*4kB (UME) 2886*8kB (UME) 2096*16kB (UME) 153*32kB (UM) 1807*64kB (UM) 365*128kB (UM) 48*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 247144kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951625] Node 0 Normal: 10883*4kB (UMH) 9711*8kB (UMEH) 4777*16kB (UE) 4*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 197780kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951632] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951633] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951634] 1758045 total pagecache pages
Feb 27 01:07:59 eisenhower kernel: [2118254.951634] 0 pages in swap cache
Feb 27 01:07:59 eisenhower kernel: [2118254.951635] Swap cache stats: add 0, delete 0, find 0/0
Feb 27 01:07:59 eisenhower kernel: [2118254.951636] Free swap  = 0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951636] Total swap = 0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951636] 16754614 pages RAM
Feb 27 01:07:59 eisenhower kernel: [2118254.951637] 0 pages HighMem/MovableOnly
Feb 27 01:07:59 eisenhower kernel: [2118254.951637] 314056 pages reserved
Feb 27 01:07:59 eisenhower kernel: [2118254.951638] 0 pages hwpoisoned
Feb 27 01:07:59 eisenhower kernel: [2118254.951639] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:07:59 eisenhower kernel: [2118254.961216] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:01 eisenhower kernel: [2118256.556803] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:01 eisenhower kernel: [2118256.566491] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:04 eisenhower kernel: [2118259.730379] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:04 eisenhower kernel: [2118259.796732] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:04 eisenhower kernel: [2118259.861700] CIFS: VFS: \\u1123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.058771] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.189119] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.319392] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.330267] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.340734] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.841646] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.854496] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.865957] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:05 eisenhower kernel: [2118260.877073] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:06 eisenhower kernel: [2118261.254423] CIFS: VFS: \\u123456.your-storagebox.de crypt_message: Failed to init sg
Feb 27 01:08:55 eisenhower kernel: [2118310.403609]  zd288: p1 p2
Feb 27 01:08:55 eisenhower kernel: [2118310.548838] fwbr108i0: port 2(tap108i0) entered disabled state
Feb 27 01:08:55 eisenhower kernel: [2118310.574155] fwbr108i0: port 1(fwln108i0) entered disabled state
Feb 27 01:08:55 eisenhower kernel: [2118310.574235] vmbr0: port 4(fwpr108p0) entered disabled state
Feb 27 01:08:55 eisenhower kernel: [2118310.574419] device fwln108i0 left promiscuous mode
Feb 27 01:08:55 eisenhower kernel: [2118310.574421] fwbr108i0: port 1(fwln108i0) entered disabled state
Feb 27 01:08:55 eisenhower kernel: [2118310.600233] device fwpr108p0 left promiscuous mode
Feb 27 01:08:55 eisenhower kernel: [2118310.600236] vmbr0: port 4(fwpr108p0) entered disabled state
Feb 27 01:08:55 eisenhower qmeventd[2039]: read: Connection reset by peer
Feb 27 01:08:55 eisenhower systemd[1]: 108.scope: Succeeded.
Feb 27 01:08:55 eisenhower systemd[1]: 108.scope: Consumed 34.897s CPU time.
Feb 27 01:08:56 eisenhower vzdump[4111356]: ERROR: Backup of VM 108 failed - zstd --rsyncable --threads=1 failed - wrong exit status 1
Feb 27 01:08:56 eisenhower qmeventd[324281]: Starting cleanup for 108
Feb 27 01:08:56 eisenhower qmeventd[324281]: Finished cleanup for 108

Google show that there are other people having this issue as well and some seems to have solved by switching over to use NFS. That's not an option in my case as Hetzner doesn't provide a NFS mount for their storageboxes.

Any ideas on how to proceed with this?
 
Extra info:

The disk related confiuration for the VM are:
Storage controller: VirtIO SCSI Single
Disk: SCSI, SSD Emulation: true, Discard: true, Cache mode: Write back and Async IO: Default (io_uring)

Output of pveversion -v
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-4-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-12
pve-kernel-5.13: 7.1-7
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-3
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-6
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-5
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 
I've had the same issue. Something between Hetzner and Proxmox is misbehaving. This isn't something new, I remember having the same trouble years ago. In my case I solved it by using SSH-FS instead of SMB/CIFS, this works rock stable now.
 
flushing the backup archive requires memory, but you are running out of memory:

Code:
Feb 27 01:07:59 eisenhower kernel: [2118254.951312] kworker/u64:4: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

64kB (2^4 pages = 16 * 4kB = 64kB) of continous memory requested, but:

Code:
kernel: [2118254.951606] Node 0 Normal free:195876kB min:64280kB low:126768kB high:189256kB reserved_highatomic:2048KB active_anon:16829040kB inactive_anon:4805592kB active_file:1604196kB inactive_file:5140968kB unevictable:151080kB writepending:341820kB present:63688192kB managed:62498132kB mlocked:151040kB bounce:0kB free_pcp:448kB local_pcp:0kB free_cma:0kB
Feb 27 01:07:59 eisenhower kernel: [2118254.951625] Node 0 Normal: 10883*4kB (UMH) 9711*8kB (UMEH) 4777*16kB (UE) 4*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 197780kB

largest continous chunk of free memory is 32kB. the combination of not much free and heavily fragmented memory can lead to such issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!