Backup leaves VMs unusable

Hi,

I setup two different daily baclup jobs:
  • 00:30 Internal bkp to a CephFS pool (SAS disks)
  • 01:30 External to a PBS instance
Al nights, one or two of my Ubuntu 22.04 VMs (kubernetes nodes) go stale after the first bckup (I think).
I find them stale (and have to Stop and Start, as the Guest Client is not responding).
I think it's the first bckup to break them, as the second one already fails on them.

The VMs are all the same HW.

I do see that the bckup jobs, even if they are defined as datacenter-wise, start in parallel on the three nodes. Could this be a bottleneck for the CephFS one? Should I program different bckup jobs for different pve nodes?

1704445752404.png

1704445712386.png

Here are two examples from today, two VMs from two different proxmox nodes (pve nodes have exact same hardware), one Kubernetes ControlPlane and one Worker:
I do have other 7 (total 9) kubernetes nodes of the same type on the cluster, all nights some of them fail due to the backup jobs.

50051, Control Plane, Proxmox Node 1

VM Specs:

Code:
agent: 1
boot: order=scsi0;net0;ide0
cipassword: $5$ccOpeQpC$gMb8Dy/l6SqLtsH8mcXBwiDynUfMC9JyHpoWR4h9Jd9
ciuser: ubuntu
cores: 8
cpu: host
ide0: Ceph-NVMe2TB:vm-50051-cloudinit,media=cdrom,size=4M
ipconfig0: ip=10.0.50.51/24,gw=10.0.50.3
ipconfig1: ip=10.0.90.51/24
memory: 32768
meta: creation-qemu=7.1.0,ctime=1676235137
name: k8cp1
nameserver: 10.0.50.3
net0: virtio=BC:24:11:CC:CC:43,bridge=vmbr2,mtu=9000,tag=50
net1: virtio=BC:24:11:EE:1D:DF,bridge=vmbr1,mtu=9000
numa: 1
ostype: l26
scsi0: Ceph-NVMe2TB:vm-50051-disk-0,discard=on,iothread=1,size=100G
scsi1: Ceph-NVMe2TB:vm-50051-disk-1,backup=0,discard=on,iothread=1,size=500G
scsi2: Ceph-NVMe2TB:vm-50051-disk-2,backup=0,discard=on,iothread=1,size=500G
scsihw: virtio-scsi-single
searchdomain: urbaman.it
serial0: socket
smbios1: uuid=78d84437-f353-4169-b914-841dbb97b714
sockets: 2
startup: order=4,up=120
tags: 22.04;server;template;ubuntu
vmgenid: 258f8808-3cd9-4083-b322-a36b914cd0d4

CephFS Bckup log:

Code:
INFO: Starting Backup of VM 50051 (qemu)
INFO: Backup started at 2024-01-05 00:41:38
INFO: status = running
INFO: VM Name: k8cp1
INFO: include disk 'scsi0' 'Ceph-NVMe2TB:vm-50051-disk-0' 100G
INFO: exclude disk 'scsi1' 'Ceph-NVMe2TB:vm-50051-disk-1' (backup=no)
INFO: exclude disk 'scsi2' 'Ceph-NVMe2TB:vm-50051-disk-2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 5
INFO: creating vzdump archive '/mnt/pve/Cephfs_SAS900GB/dump/vzdump-qemu-50051-2024_01_05-00_41_38.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '0a79acb0-7f13-461b-a1a3-488a234dd3f5'
INFO: resuming VM again
INFO:   2% (2.0 GiB of 100.0 GiB) in 3s, read: 697.2 MiB/s, write: 155.7 MiB/s
INFO:   3% (3.2 GiB of 100.0 GiB) in 7s, read: 302.7 MiB/s, write: 149.4 MiB/s
INFO:   4% (4.2 GiB of 100.0 GiB) in 10s, read: 348.0 MiB/s, write: 109.6 MiB/s
INFO:   5% (5.0 GiB of 100.0 GiB) in 15s, read: 161.7 MiB/s, write: 103.4 MiB/s
INFO:   6% (6.5 GiB of 100.0 GiB) in 18s, read: 500.2 MiB/s, write: 142.3 MiB/s
INFO:   7% (7.0 GiB of 100.0 GiB) in 21s, read: 176.4 MiB/s, write: 143.2 MiB/s
INFO:   8% (8.1 GiB of 100.0 GiB) in 26s, read: 226.2 MiB/s, write: 157.7 MiB/s
INFO:   9% (9.1 GiB of 100.0 GiB) in 33s, read: 148.1 MiB/s, write: 115.1 MiB/s
INFO:  10% (10.1 GiB of 100.0 GiB) in 36s, read: 315.4 MiB/s, write: 114.8 MiB/s
INFO:  11% (11.3 GiB of 100.0 GiB) in 39s, read: 430.8 MiB/s, write: 126.5 MiB/s
INFO:  13% (13.1 GiB of 100.0 GiB) in 42s, read: 612.0 MiB/s, write: 155.6 MiB/s
INFO:  14% (14.0 GiB of 100.0 GiB) in 47s, read: 182.0 MiB/s, write: 125.0 MiB/s
INFO:  15% (15.1 GiB of 100.0 GiB) in 54s, read: 157.9 MiB/s, write: 126.3 MiB/s
INFO:  16% (16.0 GiB of 100.0 GiB) in 1m 1s, read: 140.9 MiB/s, write: 116.8 MiB/s
INFO:  17% (17.1 GiB of 100.0 GiB) in 1m 5s, read: 280.9 MiB/s, write: 114.6 MiB/s
INFO:  18% (18.0 GiB of 100.0 GiB) in 1m 11s, read: 148.6 MiB/s, write: 115.6 MiB/s
INFO:  19% (19.3 GiB of 100.0 GiB) in 1m 18s, read: 181.2 MiB/s, write: 168.3 MiB/s
INFO:  20% (20.5 GiB of 100.0 GiB) in 1m 22s, read: 331.1 MiB/s, write: 143.7 MiB/s
INFO:  34% (34.9 GiB of 100.0 GiB) in 1m 25s, read: 4.8 GiB/s, write: 81.2 MiB/s
INFO:  51% (51.2 GiB of 100.0 GiB) in 1m 28s, read: 5.4 GiB/s, write: 137.3 MiB/s
INFO:  82% (82.5 GiB of 100.0 GiB) in 1m 31s, read: 10.5 GiB/s, write: 40.0 KiB/s
INFO: 100% (100.0 GiB of 100.0 GiB) in 1m 33s, read: 8.7 GiB/s, write: 52.0 KiB/s
INFO: backup is sparse: 88.83 GiB (88%) total zero data
INFO: transferred 100.00 GiB in 93 seconds (1.1 GiB/s)
INFO: archive file size: 5.37GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-daily=2, keep-weekly=1
INFO: removing backup 'Cephfs_SAS900GB:backup/vzdump-qemu-50051-2023_12_24-00_40_52.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 50051 (00:01:35)
INFO: Backup finished at 2024-01-05 00:43:13

PBS Bckup log:

Code:
INFO: Starting Backup of VM 50051 (qemu)
INFO: Backup started at 2024-01-05 01:32:46
INFO: status = running
INFO: VM Name: k8cp1
INFO: include disk 'scsi0' 'Ceph-NVMe2TB:vm-50051-disk-0' 100G
INFO: exclude disk 'scsi1' 'Ceph-NVMe2TB:vm-50051-disk-1' (backup=no)
INFO: exclude disk 'scsi2' 'Ceph-NVMe2TB:vm-50051-disk-2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 5
INFO: creating Proxmox Backup Server archive 'vm/50051/2024-01-05T00:32:46Z'
INFO: issuing guest-agent 'fs-freeze' command
ERROR: VM 50051 qmp command 'guest-fsfreeze-freeze' failed - got timeout
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 50051 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '6be1e7a6-d8f3-46a4-8695-cb7317069b20'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (2.4 GiB of 100.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 2.4 GiB dirty of 100.0 GiB total
INFO: 100% (2.4 GiB of 2.4 GiB) in 0s, read: 2.4 GiB/s, write: 1.9 GiB/s
INFO: backup was done incrementally, reused 98.14 GiB (98%)
INFO: transferred 2.39 GiB in <1 seconds
INFO: adding notes to backup
INFO: Finished Backup of VM 50051 (01:03:01)
INFO: Backup finished at 2024-01-05 02:35:47

50058, Worker, Proxmox Node 2

VM Specs:

Code:
agent: 1
boot: order=scsi0;net0;ide0
cipassword: $5$ccOpeQpC$gMb8Dy/l6SqLtsH8mcXBwiDynUfMC9JyHpoWR4h9Jd9
ciuser: ubuntu
cores: 8
cpu: host
ide0: Ceph-NVMe2TB:vm-50058-cloudinit,media=cdrom,size=4M
ipconfig0: ip=10.0.50.58/24,gw=10.0.50.3
ipconfig1: ip=10.0.90.58/24
memory: 32768
meta: creation-qemu=7.1.0,ctime=1676235137
name: k8w5
nameserver: 10.0.50.3
net0: virtio=BC:24:11:78:E2:72,bridge=vmbr2,mtu=9000,tag=50
net1: virtio=BC:24:11:23:B1:58,bridge=vmbr1,mtu=9000
numa: 1
ostype: l26
scsi0: Ceph-NVMe2TB:vm-50058-disk-0,discard=on,iothread=1,size=100G
scsi1: Ceph-NVMe2TB:vm-50058-disk-1,backup=0,discard=on,iothread=1,size=500G
scsi2: Ceph-NVMe2TB:vm-50058-disk-2,backup=0,discard=on,iothread=1,size=500G
scsihw: virtio-scsi-single
searchdomain: urbaman.it
serial0: socket
smbios1: uuid=aa4ac842-8852-4380-afb4-eca0ee18d9c1
sockets: 2
startup: order=4,up=120
tags: 22.04;server;template;ubuntu
vmgenid: 0bfcc4cd-d0e9-4db5-a681-20ba22f04941

CephFS Bckup log:

Code:
INFO: Starting Backup of VM 50058 (qemu)
INFO: Backup started at 2024-01-05 00:45:49
INFO: status = running
INFO: VM Name: k8w5
INFO: include disk 'scsi0' 'Ceph-NVMe2TB:vm-50058-disk-0' 100G
INFO: exclude disk 'scsi1' 'Ceph-NVMe2TB:vm-50058-disk-1' (backup=no)
INFO: exclude disk 'scsi2' 'Ceph-NVMe2TB:vm-50058-disk-2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 5
INFO: creating vzdump archive '/mnt/pve/Cephfs_SAS900GB/dump/vzdump-qemu-50058-2024_01_05-00_45_49.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5f37eb88-8502-4d72-b63f-baed195b7410'
INFO: resuming VM again
INFO:   1% (2.0 GiB of 100.0 GiB) in 3s, read: 670.6 MiB/s, write: 142.1 MiB/s
INFO:   2% (2.4 GiB of 100.0 GiB) in 6s, read: 150.3 MiB/s, write: 121.1 MiB/s
INFO:   3% (4.0 GiB of 100.0 GiB) in 9s, read: 540.9 MiB/s, write: 177.3 MiB/s
INFO:   4% (4.5 GiB of 100.0 GiB) in 12s, read: 158.3 MiB/s, write: 88.0 MiB/s
INFO:   5% (5.1 GiB of 100.0 GiB) in 18s, read: 111.8 MiB/s, write: 88.1 MiB/s
INFO:   6% (6.0 GiB of 100.0 GiB) in 24s, read: 153.5 MiB/s, write: 117.4 MiB/s
INFO:   7% (7.0 GiB of 100.0 GiB) in 31s, read: 147.1 MiB/s, write: 126.6 MiB/s
INFO:   8% (8.0 GiB of 100.0 GiB) in 36s, read: 207.3 MiB/s, write: 164.8 MiB/s
INFO:   9% (9.0 GiB of 100.0 GiB) in 44s, read: 130.0 MiB/s, write: 120.2 MiB/s
INFO:  10% (10.0 GiB of 100.0 GiB) in 50s, read: 167.4 MiB/s, write: 150.3 MiB/s
INFO:  11% (11.1 GiB of 100.0 GiB) in 57s, read: 153.8 MiB/s, write: 124.8 MiB/s
INFO:  12% (12.1 GiB of 100.0 GiB) in 1m 4s, read: 148.6 MiB/s, write: 134.8 MiB/s
INFO:  13% (13.1 GiB of 100.0 GiB) in 1m 9s, read: 210.8 MiB/s, write: 92.9 MiB/s
INFO:  14% (14.1 GiB of 100.0 GiB) in 1m 15s, read: 159.7 MiB/s, write: 114.6 MiB/s
INFO:  15% (15.2 GiB of 100.0 GiB) in 1m 20s, read: 229.8 MiB/s, write: 102.0 MiB/s
INFO:  16% (16.1 GiB of 100.0 GiB) in 1m 26s, read: 153.5 MiB/s, write: 122.7 MiB/s
INFO:  17% (17.2 GiB of 100.0 GiB) in 1m 36s, read: 113.1 MiB/s, write: 104.7 MiB/s
INFO:  18% (18.0 GiB of 100.0 GiB) in 1m 42s, read: 143.2 MiB/s, write: 117.4 MiB/s
INFO:  19% (19.5 GiB of 100.0 GiB) in 1m 45s, read: 506.3 MiB/s, write: 229.8 MiB/s
INFO:  20% (20.1 GiB of 100.0 GiB) in 1m 48s, read: 207.8 MiB/s, write: 98.1 MiB/s
INFO:  21% (21.9 GiB of 100.0 GiB) in 1m 52s, read: 457.1 MiB/s, write: 157.0 MiB/s
INFO:  22% (22.3 GiB of 100.0 GiB) in 1m 55s, read: 133.6 MiB/s, write: 131.8 MiB/s
INFO:  23% (23.9 GiB of 100.0 GiB) in 2m 1s, read: 274.3 MiB/s, write: 109.2 MiB/s
INFO:  24% (24.2 GiB of 100.0 GiB) in 2m 4s, read: 94.2 MiB/s, write: 93.8 MiB/s
INFO:  29% (29.7 GiB of 100.0 GiB) in 2m 9s, read: 1.1 GiB/s, write: 78.3 MiB/s
INFO:  30% (30.1 GiB of 100.0 GiB) in 2m 12s, read: 126.9 MiB/s, write: 83.2 MiB/s
INFO:  31% (31.7 GiB of 100.0 GiB) in 2m 16s, read: 427.8 MiB/s, write: 99.5 MiB/s
INFO:  33% (34.0 GiB of 100.0 GiB) in 2m 19s, read: 770.9 MiB/s, write: 237.9 MiB/s
INFO:  46% (46.3 GiB of 100.0 GiB) in 2m 22s, read: 4.1 GiB/s, write: 56.7 MiB/s
INFO:  49% (50.0 GiB of 100.0 GiB) in 2m 25s, read: 1.2 GiB/s, write: 81.6 MiB/s
INFO:  50% (50.8 GiB of 100.0 GiB) in 2m 28s, read: 274.3 MiB/s, write: 121.5 MiB/s
INFO:  51% (51.1 GiB of 100.0 GiB) in 2m 31s, read: 100.3 MiB/s, write: 98.6 MiB/s
INFO:  58% (58.3 GiB of 100.0 GiB) in 2m 37s, read: 1.2 GiB/s, write: 94.7 MiB/s
INFO:  86% (86.6 GiB of 100.0 GiB) in 2m 40s, read: 9.5 GiB/s, write: 7.7 MiB/s
INFO: 100% (100.0 GiB of 100.0 GiB) in 2m 43s, read: 4.5 GiB/s, write: 32.0 KiB/s
INFO: backup is sparse: 81.72 GiB (81%) total zero data
INFO: transferred 100.00 GiB in 163 seconds (628.2 MiB/s)
INFO: archive file size: 9.29GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-daily=2, keep-weekly=1
INFO: removing backup 'Cephfs_SAS900GB:backup/vzdump-qemu-50058-2023_12_24-00_57_47.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 50058 (00:04:28)

PBS Bckup log:

Code:
INFO: Starting Backup of VM 50058 (qemu)
INFO: Backup started at 2024-01-05 01:32:52
INFO: status = running
INFO: VM Name: k8w5
INFO: include disk 'scsi0' 'Ceph-NVMe2TB:vm-50058-disk-0' 100G
INFO: exclude disk 'scsi1' 'Ceph-NVMe2TB:vm-50058-disk-1' (backup=no)
INFO: exclude disk 'scsi2' 'Ceph-NVMe2TB:vm-50058-disk-2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 5
INFO: creating Proxmox Backup Server archive 'vm/50058/2024-01-05T00:32:52Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 50058 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'b3ee5409-2b2d-4517-9160-18865933b5e7'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (4.3 GiB of 100.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 4.3 GiB dirty of 100.0 GiB total
INFO: 100% (4.3 GiB of 4.3 GiB) in 0s, read: 4.3 GiB/s, write: 4.2 GiB/s
INFO: backup was done incrementally, reused 95.84 GiB (95%)
INFO: transferred 4.29 GiB in <1 seconds
INFO: adding notes to backup
INFO: Finished Backup of VM 50058 (00:03:01)
 
Hi,
I do see that the bckup jobs, even if they are defined as datacenter-wise, start in parallel on the three nodes. Could this be a bottleneck for the CephFS one? Should I program different bckup jobs for different pve nodes?
yes, that helped other users with similar issues.

If the issue still happens, it might be a rare issue related to iothread in QEMU 8.1 versions. A fix is still being worked on. You could try turning off the iothread setting on your VM's drives and see if that helps.
 
  • Like
Reactions: Urbaman
1.
Code:
ERROR: VM 50051 qmp command 'guest-fsfreeze-freeze' failed - got timeout


check qemu-guest-agent is running on vm 50051

systemctl status qemu-guest-agent.service


2. try disable iothread in drive setting - pbs use diff from previous backup, but iothread is problem for snapshot backup and diffs
 
Last edited:
  • Like
Reactions: Urbaman

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!