Guest VMs crash during Backup

mtze

Member
Jul 2, 2020
5
1
8
Germany
mtze.me
Hey there,
I encountered a strange problem the other day. So first let me describe our setup. We use proxmox for VM and container hosting. Up until now, we used local storage (ZFS) in each of our hypervisors. As you know there are many drawbacks with only local storage. We finally wanted to have HA. As another group at my workplace offers Ceph as a service, we opted for that. Each of our hypervisors is connected with 2x10GBit/s to the Ceph network.
I did lots of testing, and it worked great. Last week I migrated parts of our production system to the Ceph cluster. No problems - until the backup-tasks startet (3:00). The backups are stored via NFS to a FreeNAS (connected with 10GBit/s) never had issues with the backups before)

The first night our monitoring system alerted like crazy (mostly high disk latency, high I/O wait and some web-services did not answer in time). I checked the whole setup again and got in contact to he Ceph administrators. We tried to find out what the problem was. But Ceph wise everything looked ok. The Ceph links are barely used (peak at ~4Gbit/s).

The next night 4 out of 7 production systems that were deployed in the Ceph cluster crashed - all with a kernel panic (see screenshot). According to our monitoring system the first vm (ID903) went down at ~4:20. The backup log shows that the backup task for that exact VM (903) was started at 8:27 in the morning - so from my understanding that shouldn't have been a problem). The VM (ID 801) for which a backup task was running at 4:20 took over 2,5h to complete (and also crashed during the backup ~2:35)

Backup log from VM 801:

Code:
INFO: Starting Backup of VM 801 (qemu)
INFO: Backup started at 2021-02-09 03:54:27
INFO: status = running
INFO: VM Name: XXXXXXXXXXXX
INFO: include disk 'scsi0' 'rbd:vm-801-disk-0' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating vzdump archive '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-801-2021_02_09-03_54_27.vma.zst'
INFO: started backup task 'a3cda6f7-e806-433b-9ce5-8985c389532a'
INFO: resuming VM again
INFO:   0% (112.0 MiB of 300.0 GiB) in  3s, read: 37.3 MiB/s, write: 25.9 MiB/s
INFO:   1% (3.0 GiB of 300.0 GiB) in  2m  5s, read: 24.3 MiB/s, write: 21.8 MiB/s
INFO:   2% (6.0 GiB of 300.0 GiB) in  4m 22s, read: 22.4 MiB/s, write: 3.9 MiB/s
INFO:   3% (9.0 GiB of 300.0 GiB) in  6m 23s, read: 25.5 MiB/s, write: 1.8 MiB/s
INFO:   4% (12.0 GiB of 300.0 GiB) in  9m  3s, read: 19.1 MiB/s, write: 14.1 MiB/s
INFO:   5% (15.0 GiB of 300.0 GiB) in 11m 35s, read: 20.4 MiB/s, write: 19.9 MiB/s
INFO:   6% (18.0 GiB of 300.0 GiB) in 14m 21s, read: 18.4 MiB/s, write: 18.1 MiB/s
INFO:   7% (21.0 GiB of 300.0 GiB) in 17m 12s, read: 18.0 MiB/s, write: 17.6 MiB/s
INFO:   8% (24.0 GiB of 300.0 GiB) in 20m 12s, read: 17.1 MiB/s, write: 17.0 MiB/s
INFO:   9% (27.0 GiB of 300.0 GiB) in 23m 32s, read: 15.3 MiB/s, write: 14.9 MiB/s
INFO:  10% (30.0 GiB of 300.0 GiB) in 27m 22s, read: 13.3 MiB/s, write: 13.2 MiB/s
INFO:  11% (33.0 GiB of 300.0 GiB) in 30m 47s, read: 15.1 MiB/s, write: 14.7 MiB/s
INFO:  12% (36.0 GiB of 300.0 GiB) in 33m 51s, read: 16.6 MiB/s, write: 16.4 MiB/s
INFO:  13% (39.0 GiB of 300.0 GiB) in 37m 22s, read: 14.7 MiB/s, write: 14.3 MiB/s
INFO:  14% (42.0 GiB of 300.0 GiB) in 40m  8s, read: 18.6 MiB/s, write: 18.4 MiB/s
INFO:  15% (45.0 GiB of 300.0 GiB) in 42m 54s, read: 18.4 MiB/s, write: 18.0 MiB/s
INFO:  16% (48.1 GiB of 300.0 GiB) in 45m 15s, read: 21.9 MiB/s, write: 21.7 MiB/s
INFO:  17% (51.0 GiB of 300.0 GiB) in 47m 47s, read: 19.9 MiB/s, write: 19.4 MiB/s
INFO:  18% (54.0 GiB of 300.0 GiB) in 50m 42s, read: 17.6 MiB/s, write: 17.4 MiB/s
INFO:  19% (57.0 GiB of 300.0 GiB) in 53m 44s, read: 16.9 MiB/s, write: 16.5 MiB/s
INFO:  20% (60.0 GiB of 300.0 GiB) in 56m 26s, read: 18.9 MiB/s, write: 18.7 MiB/s
INFO:  21% (63.0 GiB of 300.0 GiB) in 58m 57s, read: 20.5 MiB/s, write: 20.1 MiB/s
INFO:  22% (66.0 GiB of 300.0 GiB) in  1h  1m 29s, read: 20.0 MiB/s, write: 19.9 MiB/s
INFO:  23% (69.0 GiB of 300.0 GiB) in  1h  3m  5s, read: 32.3 MiB/s, write: 31.2 MiB/s
...
INFO:  89% (267.1 GiB of 300.0 GiB) in  2h 34m 28s, read: 74.3 MiB/s, write: 72.6 MiB/s
INFO:  90% (270.1 GiB of 300.0 GiB) in  2h 35m 12s, read: 70.7 MiB/s, write: 69.2 MiB/s
INFO:  91% (273.1 GiB of 300.0 GiB) in  2h 35m 53s, read: 74.3 MiB/s, write: 73.6 MiB/s
INFO:  92% (276.1 GiB of 300.0 GiB) in  2h 36m 34s, read: 75.1 MiB/s, write: 73.4 MiB/s
INFO:  93% (279.0 GiB of 300.0 GiB) in  2h 37m 16s, read: 71.5 MiB/s, write: 69.7 MiB/s
INFO:  94% (282.0 GiB of 300.0 GiB) in  2h 38m 14s, read: 53.6 MiB/s, write: 52.9 MiB/s
INFO:  95% (285.0 GiB of 300.0 GiB) in  2h 38m 55s, read: 74.0 MiB/s, write: 72.6 MiB/s
INFO:  96% (288.0 GiB of 300.0 GiB) in  2h 39m 35s, read: 77.0 MiB/s, write: 76.2 MiB/s
INFO:  97% (291.1 GiB of 300.0 GiB) in  2h 40m  8s, read: 95.4 MiB/s, write: 93.6 MiB/s
INFO:  98% (294.1 GiB of 300.0 GiB) in  2h 40m 49s, read: 74.0 MiB/s, write: 72.5 MiB/s
INFO:  99% (297.1 GiB of 300.0 GiB) in  2h 41m 29s, read: 78.1 MiB/s, write: 77.3 MiB/s
INFO: 100% (300.0 GiB of 300.0 GiB) in  2h 42m  3s, read: 87.2 MiB/s, write: 86.2 MiB/s
INFO: backup is sparse: 11.07 GiB (3%) total zero data
INFO: transferred 300.00 GiB in 9723 seconds (31.6 MiB/s)
INFO: archive file size: 66.25GB
INFO: delete old backup '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-801-2021_02_02-04_21_11.vma.zst'
INFO: Finished Backup of VM 801 (02:42:06)

Backup Log of VM 903

Code:
INFO: Starting Backup of VM 903 (qemu)
INFO: Backup started at 2021-02-09 08:27:21
INFO: status = running
INFO: VM Name: XXXXXXXXXX
INFO: include disk 'scsi0' 'rbd:vm-903-disk-0' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-903-2021_02_09-08_27_21.vma.zst'
INFO: started backup task 'c8adae24-1a23-4677-bc9b-98c05a479445'
INFO: resuming VM again
INFO:   0% (388.0 MiB of 100.0 GiB) in  3s, read: 129.3 MiB/s, write: 83.2 MiB/s
INFO:   1% (1.0 GiB of 100.0 GiB) in  9s, read: 110.7 MiB/s, write: 104.5 MiB/s
INFO:   2% (2.1 GiB of 100.0 GiB) in 16s, read: 153.2 MiB/s, write: 75.2 MiB/s
INFO:   3% (3.1 GiB of 100.0 GiB) in 25s, read: 113.8 MiB/s, write: 90.7 MiB/s
INFO:   4% (4.1 GiB of 100.0 GiB) in 34s, read: 115.0 MiB/s, write: 99.4 MiB/s
INFO:   5% (5.1 GiB of 100.0 GiB) in 45s, read: 95.6 MiB/s, write: 88.7 MiB/s
INFO:   6% (6.0 GiB of 100.0 GiB) in 54s, read: 101.9 MiB/s, write: 99.5 MiB/s
INFO:   7% (7.1 GiB of 100.0 GiB) in  1m  5s, read: 97.5 MiB/s, write: 94.3 MiB/s
INFO:   8% (8.0 GiB of 100.0 GiB) in  1m 15s, read: 97.9 MiB/s, write: 91.3 MiB/s
....
INFO:  90% (90.2 GiB of 100.0 GiB) in  9m 21s, read: 214.4 MiB/s, write: 22.9 MiB/s
INFO:  91% (91.1 GiB of 100.0 GiB) in  9m 25s, read: 246.0 MiB/s, write: 3.9 MiB/s
INFO:  92% (92.0 GiB of 100.0 GiB) in  9m 29s, read: 230.0 MiB/s, write: 12.9 MiB/s
INFO:  93% (93.0 GiB of 100.0 GiB) in  9m 33s, read: 254.0 MiB/s, write: 240.0 KiB/s
INFO:  94% (94.2 GiB of 100.0 GiB) in  9m 38s, read: 242.4 MiB/s, write: 11.4 MiB/s
INFO:  95% (95.2 GiB of 100.0 GiB) in  9m 42s, read: 258.0 MiB/s, write: 0 B/s
INFO:  96% (96.2 GiB of 100.0 GiB) in  9m 46s, read: 256.0 MiB/s, write: 0 B/s
INFO:  97% (97.2 GiB of 100.0 GiB) in  9m 50s, read: 257.0 MiB/s, write: 0 B/s
INFO:  98% (98.2 GiB of 100.0 GiB) in  9m 54s, read: 255.0 MiB/s, write: 0 B/s
INFO:  99% (99.2 GiB of 100.0 GiB) in  9m 58s, read: 247.0 MiB/s, write: 0 B/s
INFO: 100% (100.0 GiB of 100.0 GiB) in 10m  2s, read: 215.0 MiB/s, write: 2.0 KiB/s
INFO: backup is sparse: 69.31 GiB (69%) total zero data
INFO: transferred 100.00 GiB in 602 seconds (170.1 MiB/s)
INFO: archive file size: 11.67GB
INFO: delete old backup '/mnt/pve/XXXXXXXXXX/dump/vzdump-qemu-903-2021_02_02-08_33_17.vma.zst'
INFO: Finished Backup of VM 903 (00:10:02)
INFO: Backup finished at 2021-02-09 08:37:23



As I said, all the other users of the Ceph cluster haven't had any problems during that time.
I found a similar issue here: https://forum.proxmox.com/threads/vm-hangs-during-backup.62781/ There the link to the Storage server was to slow - As all the other backups (from local storage) complete without any problems - I don't think that this was the issue.


Here the pve version:

Code:
pveversion -v

proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

And RBD bench from the corresponding hypervisor:

Code:
/e/ceph# rbd bench -n XXX --io-type rw --io-pattern rand --io-size=4K --io-total=1G XXX
bench  type readwrite read:write=50:50 io_size 4096 io_threads 16 bytes 1073741824 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     11008  11024.01  45154344.14
   ....
elapsed:    72  ops:   262144  ops/sec:  3591.61  bytes/sec: 14711223.46
read_ops: 131207   read_ops/sec:  1797.65   read_bytes/sec: 7363187.78
write_ops: 130937   write_ops/sec:  1793.95   write_bytes/sec: 7348035.69

What can I do, to prevent the proxmox backup to crash my VMs?

BTW: why isn't proxmox simply creating a storage snapshot and transfer the contents without locking the VM? Apparently they use some qemu feature (That "might slow down the vm" -> https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt ).

Thanks in advance!
Cheers
Matze
 

Attachments

  • panic.png
    panic.png
    763.3 KB · Views: 14
how does the vm config look like?
backup logs look fine, though it seems there is some bottleneck since the speeds are not really high, maybe the guest writes gets slowed down so much they time out?
is the target storage for the backup keeping up?

BTW: why isn't proxmox simply creating a storage snapshot and transfer the contents without locking the VM? Apparently they use some qemu feature (That "might slow down the vm" -> https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt ).
because not every storage supports snapshots (e.g think raw files or blockdevices), but with this method we can have a snapshot-like backup on every underlying storage
 
I finally found out why the backup crashed the VMs.

TLDR;
The KVM64 default cpu type proxmox provisions seems to not support important co-processors that qemu uses to speed up the backup. After changing the CPU type for all VMs to the "greater common denominator" of all pve nodes (in my case Broadwell) the backup speed increased dramatically.


Maybe this also helps somebody in the future :)


What I also tested (among others):

- VM in Ceph config 1 (separated metadata (replicated) and data pools ("raid")) -> backup to local ZFS storage - More or less the same result as the original backup strategy
- VM in Ceph config 2 (one replaced pool for metadata and data) -> backup to local ZFS storage - slightly better results but still devastatingly slow
- VM in Ceph config 1 -> Backup to SSD backed ZFS via NFS -> same result as original config
- VM in Ceph config 2 -> Backup to SSD backed ZFS via NFS -> more or less the same result as original config
- Network measurements in Ceph network - nominal
- Network measurements in NFS network - nominal


---
because not every storage supports snapshots (e.g think raw files or blockdevices), but with this method we can have a snapshot-like backup on every underlying storage
I do see the point. However, from my point of view it would be a good idea to adapt the backup strategy depending on the underlaying storage. I know that proxmox tries to be as compatible as possible, but this leaves so much performance on the table. As a pro user it would be great to at least have the option to specify how proxmox should create backups (eg. RBD/ZFS snapshots) - while eg. beginners can stick with the safe "QEMU standadadized" option.


Thanks for the help anyways.

Best,
Matthias
 
I am having quite a few similar crashes which seem to happen during backups. Can you clarify what you meant when you said you changed the cpu type to "greater common denominator"?

Mat
 
I am having quite a few similar crashes which seem to happen during backups. Can you clarify what you meant when you said you changed the cpu type to "greater common denominator"?

Mat
Hi Mat,

each VM in proxmox is deployed with specific virtualization CPU type. By default pve uses KVM64 as CPU type. This cpu type ensures that the VM runs on almost all hypervisor hosts. However, this KVM64 type is not optimal from a performance point of view. On the other hand, as KVM64 runs on all hypervisor hosts, it ensures that a VM can be hot migrated from any pve node to any other pve node in your cluster.

What I mean with "greater common denominator" is the cpu type that ALL your cluster nodes support. If you have a homogenous cluster (meaning, all your cluster nodes are identical) you can choose CPU type "Host". This is rarely the case though. Currently my VMs all share the CPU type "Broadwell". This ensures that I can migrate all VMs to all Hosts while maintaining the best possible performance.

Hope that helps!

Best
Matthias
 
I can fully confirm this. I had several guests crashing about every day during backup. After I changed to Sandy bridge, they did rund stable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!