Hey there,
I encountered a strange problem the other day. So first let me describe our setup. We use proxmox for VM and container hosting. Up until now, we used local storage (ZFS) in each of our hypervisors. As you know there are many drawbacks with only local storage. We finally wanted to have HA. As another group at my workplace offers Ceph as a service, we opted for that. Each of our hypervisors is connected with 2x10GBit/s to the Ceph network.
I did lots of testing, and it worked great. Last week I migrated parts of our production system to the Ceph cluster. No problems - until the backup-tasks startet (3:00). The backups are stored via NFS to a FreeNAS (connected with 10GBit/s) never had issues with the backups before)
The first night our monitoring system alerted like crazy (mostly high disk latency, high I/O wait and some web-services did not answer in time). I checked the whole setup again and got in contact to he Ceph administrators. We tried to find out what the problem was. But Ceph wise everything looked ok. The Ceph links are barely used (peak at ~4Gbit/s).
The next night 4 out of 7 production systems that were deployed in the Ceph cluster crashed - all with a kernel panic (see screenshot). According to our monitoring system the first vm (ID903) went down at ~4:20. The backup log shows that the backup task for that exact VM (903) was started at 8:27 in the morning - so from my understanding that shouldn't have been a problem). The VM (ID 801) for which a backup task was running at 4:20 took over 2,5h to complete (and also crashed during the backup ~2:35)
Backup log from VM 801:
Backup Log of VM 903
As I said, all the other users of the Ceph cluster haven't had any problems during that time.
I found a similar issue here: https://forum.proxmox.com/threads/vm-hangs-during-backup.62781/ There the link to the Storage server was to slow - As all the other backups (from local storage) complete without any problems - I don't think that this was the issue.
Here the pve version:
And RBD bench from the corresponding hypervisor:
What can I do, to prevent the proxmox backup to crash my VMs?
BTW: why isn't proxmox simply creating a storage snapshot and transfer the contents without locking the VM? Apparently they use some qemu feature (That "might slow down the vm" -> https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt ).
Thanks in advance!
Cheers
Matze
I encountered a strange problem the other day. So first let me describe our setup. We use proxmox for VM and container hosting. Up until now, we used local storage (ZFS) in each of our hypervisors. As you know there are many drawbacks with only local storage. We finally wanted to have HA. As another group at my workplace offers Ceph as a service, we opted for that. Each of our hypervisors is connected with 2x10GBit/s to the Ceph network.
I did lots of testing, and it worked great. Last week I migrated parts of our production system to the Ceph cluster. No problems - until the backup-tasks startet (3:00). The backups are stored via NFS to a FreeNAS (connected with 10GBit/s) never had issues with the backups before)
The first night our monitoring system alerted like crazy (mostly high disk latency, high I/O wait and some web-services did not answer in time). I checked the whole setup again and got in contact to he Ceph administrators. We tried to find out what the problem was. But Ceph wise everything looked ok. The Ceph links are barely used (peak at ~4Gbit/s).
The next night 4 out of 7 production systems that were deployed in the Ceph cluster crashed - all with a kernel panic (see screenshot). According to our monitoring system the first vm (ID903) went down at ~4:20. The backup log shows that the backup task for that exact VM (903) was started at 8:27 in the morning - so from my understanding that shouldn't have been a problem). The VM (ID 801) for which a backup task was running at 4:20 took over 2,5h to complete (and also crashed during the backup ~2:35)
Backup log from VM 801:
Code:
INFO: Starting Backup of VM 801 (qemu)
INFO: Backup started at 2021-02-09 03:54:27
INFO: status = running
INFO: VM Name: XXXXXXXXXXXX
INFO: include disk 'scsi0' 'rbd:vm-801-disk-0' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating vzdump archive '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-801-2021_02_09-03_54_27.vma.zst'
INFO: started backup task 'a3cda6f7-e806-433b-9ce5-8985c389532a'
INFO: resuming VM again
INFO: 0% (112.0 MiB of 300.0 GiB) in 3s, read: 37.3 MiB/s, write: 25.9 MiB/s
INFO: 1% (3.0 GiB of 300.0 GiB) in 2m 5s, read: 24.3 MiB/s, write: 21.8 MiB/s
INFO: 2% (6.0 GiB of 300.0 GiB) in 4m 22s, read: 22.4 MiB/s, write: 3.9 MiB/s
INFO: 3% (9.0 GiB of 300.0 GiB) in 6m 23s, read: 25.5 MiB/s, write: 1.8 MiB/s
INFO: 4% (12.0 GiB of 300.0 GiB) in 9m 3s, read: 19.1 MiB/s, write: 14.1 MiB/s
INFO: 5% (15.0 GiB of 300.0 GiB) in 11m 35s, read: 20.4 MiB/s, write: 19.9 MiB/s
INFO: 6% (18.0 GiB of 300.0 GiB) in 14m 21s, read: 18.4 MiB/s, write: 18.1 MiB/s
INFO: 7% (21.0 GiB of 300.0 GiB) in 17m 12s, read: 18.0 MiB/s, write: 17.6 MiB/s
INFO: 8% (24.0 GiB of 300.0 GiB) in 20m 12s, read: 17.1 MiB/s, write: 17.0 MiB/s
INFO: 9% (27.0 GiB of 300.0 GiB) in 23m 32s, read: 15.3 MiB/s, write: 14.9 MiB/s
INFO: 10% (30.0 GiB of 300.0 GiB) in 27m 22s, read: 13.3 MiB/s, write: 13.2 MiB/s
INFO: 11% (33.0 GiB of 300.0 GiB) in 30m 47s, read: 15.1 MiB/s, write: 14.7 MiB/s
INFO: 12% (36.0 GiB of 300.0 GiB) in 33m 51s, read: 16.6 MiB/s, write: 16.4 MiB/s
INFO: 13% (39.0 GiB of 300.0 GiB) in 37m 22s, read: 14.7 MiB/s, write: 14.3 MiB/s
INFO: 14% (42.0 GiB of 300.0 GiB) in 40m 8s, read: 18.6 MiB/s, write: 18.4 MiB/s
INFO: 15% (45.0 GiB of 300.0 GiB) in 42m 54s, read: 18.4 MiB/s, write: 18.0 MiB/s
INFO: 16% (48.1 GiB of 300.0 GiB) in 45m 15s, read: 21.9 MiB/s, write: 21.7 MiB/s
INFO: 17% (51.0 GiB of 300.0 GiB) in 47m 47s, read: 19.9 MiB/s, write: 19.4 MiB/s
INFO: 18% (54.0 GiB of 300.0 GiB) in 50m 42s, read: 17.6 MiB/s, write: 17.4 MiB/s
INFO: 19% (57.0 GiB of 300.0 GiB) in 53m 44s, read: 16.9 MiB/s, write: 16.5 MiB/s
INFO: 20% (60.0 GiB of 300.0 GiB) in 56m 26s, read: 18.9 MiB/s, write: 18.7 MiB/s
INFO: 21% (63.0 GiB of 300.0 GiB) in 58m 57s, read: 20.5 MiB/s, write: 20.1 MiB/s
INFO: 22% (66.0 GiB of 300.0 GiB) in 1h 1m 29s, read: 20.0 MiB/s, write: 19.9 MiB/s
INFO: 23% (69.0 GiB of 300.0 GiB) in 1h 3m 5s, read: 32.3 MiB/s, write: 31.2 MiB/s
...
INFO: 89% (267.1 GiB of 300.0 GiB) in 2h 34m 28s, read: 74.3 MiB/s, write: 72.6 MiB/s
INFO: 90% (270.1 GiB of 300.0 GiB) in 2h 35m 12s, read: 70.7 MiB/s, write: 69.2 MiB/s
INFO: 91% (273.1 GiB of 300.0 GiB) in 2h 35m 53s, read: 74.3 MiB/s, write: 73.6 MiB/s
INFO: 92% (276.1 GiB of 300.0 GiB) in 2h 36m 34s, read: 75.1 MiB/s, write: 73.4 MiB/s
INFO: 93% (279.0 GiB of 300.0 GiB) in 2h 37m 16s, read: 71.5 MiB/s, write: 69.7 MiB/s
INFO: 94% (282.0 GiB of 300.0 GiB) in 2h 38m 14s, read: 53.6 MiB/s, write: 52.9 MiB/s
INFO: 95% (285.0 GiB of 300.0 GiB) in 2h 38m 55s, read: 74.0 MiB/s, write: 72.6 MiB/s
INFO: 96% (288.0 GiB of 300.0 GiB) in 2h 39m 35s, read: 77.0 MiB/s, write: 76.2 MiB/s
INFO: 97% (291.1 GiB of 300.0 GiB) in 2h 40m 8s, read: 95.4 MiB/s, write: 93.6 MiB/s
INFO: 98% (294.1 GiB of 300.0 GiB) in 2h 40m 49s, read: 74.0 MiB/s, write: 72.5 MiB/s
INFO: 99% (297.1 GiB of 300.0 GiB) in 2h 41m 29s, read: 78.1 MiB/s, write: 77.3 MiB/s
INFO: 100% (300.0 GiB of 300.0 GiB) in 2h 42m 3s, read: 87.2 MiB/s, write: 86.2 MiB/s
INFO: backup is sparse: 11.07 GiB (3%) total zero data
INFO: transferred 300.00 GiB in 9723 seconds (31.6 MiB/s)
INFO: archive file size: 66.25GB
INFO: delete old backup '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-801-2021_02_02-04_21_11.vma.zst'
INFO: Finished Backup of VM 801 (02:42:06)
Backup Log of VM 903
Code:
INFO: Starting Backup of VM 903 (qemu)
INFO: Backup started at 2021-02-09 08:27:21
INFO: status = running
INFO: VM Name: XXXXXXXXXX
INFO: include disk 'scsi0' 'rbd:vm-903-disk-0' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-903-2021_02_09-08_27_21.vma.zst'
INFO: started backup task 'c8adae24-1a23-4677-bc9b-98c05a479445'
INFO: resuming VM again
INFO: 0% (388.0 MiB of 100.0 GiB) in 3s, read: 129.3 MiB/s, write: 83.2 MiB/s
INFO: 1% (1.0 GiB of 100.0 GiB) in 9s, read: 110.7 MiB/s, write: 104.5 MiB/s
INFO: 2% (2.1 GiB of 100.0 GiB) in 16s, read: 153.2 MiB/s, write: 75.2 MiB/s
INFO: 3% (3.1 GiB of 100.0 GiB) in 25s, read: 113.8 MiB/s, write: 90.7 MiB/s
INFO: 4% (4.1 GiB of 100.0 GiB) in 34s, read: 115.0 MiB/s, write: 99.4 MiB/s
INFO: 5% (5.1 GiB of 100.0 GiB) in 45s, read: 95.6 MiB/s, write: 88.7 MiB/s
INFO: 6% (6.0 GiB of 100.0 GiB) in 54s, read: 101.9 MiB/s, write: 99.5 MiB/s
INFO: 7% (7.1 GiB of 100.0 GiB) in 1m 5s, read: 97.5 MiB/s, write: 94.3 MiB/s
INFO: 8% (8.0 GiB of 100.0 GiB) in 1m 15s, read: 97.9 MiB/s, write: 91.3 MiB/s
....
INFO: 90% (90.2 GiB of 100.0 GiB) in 9m 21s, read: 214.4 MiB/s, write: 22.9 MiB/s
INFO: 91% (91.1 GiB of 100.0 GiB) in 9m 25s, read: 246.0 MiB/s, write: 3.9 MiB/s
INFO: 92% (92.0 GiB of 100.0 GiB) in 9m 29s, read: 230.0 MiB/s, write: 12.9 MiB/s
INFO: 93% (93.0 GiB of 100.0 GiB) in 9m 33s, read: 254.0 MiB/s, write: 240.0 KiB/s
INFO: 94% (94.2 GiB of 100.0 GiB) in 9m 38s, read: 242.4 MiB/s, write: 11.4 MiB/s
INFO: 95% (95.2 GiB of 100.0 GiB) in 9m 42s, read: 258.0 MiB/s, write: 0 B/s
INFO: 96% (96.2 GiB of 100.0 GiB) in 9m 46s, read: 256.0 MiB/s, write: 0 B/s
INFO: 97% (97.2 GiB of 100.0 GiB) in 9m 50s, read: 257.0 MiB/s, write: 0 B/s
INFO: 98% (98.2 GiB of 100.0 GiB) in 9m 54s, read: 255.0 MiB/s, write: 0 B/s
INFO: 99% (99.2 GiB of 100.0 GiB) in 9m 58s, read: 247.0 MiB/s, write: 0 B/s
INFO: 100% (100.0 GiB of 100.0 GiB) in 10m 2s, read: 215.0 MiB/s, write: 2.0 KiB/s
INFO: backup is sparse: 69.31 GiB (69%) total zero data
INFO: transferred 100.00 GiB in 602 seconds (170.1 MiB/s)
INFO: archive file size: 11.67GB
INFO: delete old backup '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-903-2021_02_02-08_33_17.vma.zst'
INFO: Finished Backup of VM 903 (00:10:02)
INFO: Backup finished at 2021-02-09 08:37:23
As I said, all the other users of the Ceph cluster haven't had any problems during that time.
I found a similar issue here: https://forum.proxmox.com/threads/vm-hangs-during-backup.62781/ There the link to the Storage server was to slow - As all the other backups (from local storage) complete without any problems - I don't think that this was the issue.
Here the pve version:
Code:
pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
And RBD bench from the corresponding hypervisor:
Code:
/e/ceph# rbd bench -n XXX --io-type rw --io-pattern rand --io-size=4K --io-total=1G XXX
bench type readwrite read:write=50:50 io_size 4096 io_threads 16 bytes 1073741824 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 11008 11024.01 45154344.14
....
elapsed: 72 ops: 262144 ops/sec: 3591.61 bytes/sec: 14711223.46
read_ops: 131207 read_ops/sec: 1797.65 read_bytes/sec: 7363187.78
write_ops: 130937 write_ops/sec: 1793.95 write_bytes/sec: 7348035.69
What can I do, to prevent the proxmox backup to crash my VMs?
BTW: why isn't proxmox simply creating a storage snapshot and transfer the contents without locking the VM? Apparently they use some qemu feature (That "might slow down the vm" -> https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt ).
Thanks in advance!
Cheers
Matze