What processes/resources are used while doing a VM backup in "stop" mode

marcosscriven · May 19, 2022

I'm having an issue with my nvme controller going offline https://forum.proxmox.com/threads/p...ffff-pci_status-0x10.88604/page-2#post-471159

As you'll see in that thread, there are lots of possible causes, but I've now turned off all PCIe power state management:

Code:

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet video=efifb:off acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Initially I only ever saw this issue while playing a game in VFIO, but I've actually been able to replicate it just by backing up the Windows VM while stopped:

Code:

NFO: starting new backup job: vzdump 101 --storage pve --mode stop --compress zstd --node pve --remove 0
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2022-05-19 10:17:20
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: desktop
INFO: include disk 'scsi0' 'vms:vm-101-disk-1' 128G
INFO: include disk 'scsi1' 'vms:vm-101-disk-2' 256G
INFO: include disk 'efidisk0' 'vms:vm-101-disk-0' 4M
INFO: creating vzdump archive '/mnt/data/pve/dump/vzdump-qemu-101-2022_05_19-10_17_20.vma.zst'
INFO: starting kvm to execute backup task
INFO: started backup task 'd5cb495e-6ef2-4a2e-b804-acb19891d2fb'
INFO:   0% (786.6 MiB of 384.0 GiB) in 3s, read: 262.2 MiB/s, write: 246.4 MiB/s
INFO:   1% (3.9 GiB of 384.0 GiB) in 16s, read: 246.8 MiB/s, write: 226.9 MiB/s
...
INFO:  87% (335.1 GiB of 384.0 GiB) in 16m 37s, read: 22.9 MiB/s, write: 18.1 MiB/s
ERROR: job failed with err -125 - Operation canceled
INFO: aborting backup job
INFO: stopping kvm after backup task
trying to acquire lock...
 OK

In the logs, I see this is down to the SSD controller going offline:

Code:

216.472650] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 1216.512634] blk_update_request: I/O error, dev nvme0n1, sector 32597504 op 0x0:(READ) flags 0x80700 phys_seg 48 prio class 0
[ 1216.512662] blk_update_request: I/O error, dev nvme0n1, sector 32583552 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 0
[ 1216.512687] blk_update_request: I/O error, dev nvme0n1, sector 32591232 op 0x0:(READ) flags 0x80700 phys_seg 21 prio class 0

One thing surprised my while running the backup on the stopped VM, with the "stop" backup mode - the QM process starts up for that machine, and sometimes uses up to 800% cpu (i.e avg of 8 out of 24 cores)

Now I don't suppose the backup process itself is broken, but I'd like to know what it's doing that might highlight this problem?

Another important thing - the disk tha's going offline is the one the VM disks are on. The backups are written to another NVMe that's working fine.

fabian · May 19, 2022

a VM backup always happens "within" the Qemu process - so if the VM is not running it needs to be started for the backup (it won't really be started, since the guest never starts executing, the VM is started in a paused state that allows the backup to execute without actually booting the guest). this does mean all the resources required for the VM to run need to be available (RAM, disks, passed-through host hardware, ..).

marcosscriven · May 19, 2022

fabian said:
a VM backup always happens "within" the Qemu process - so if the VM is not running it needs to be started for the backup (it won't really be started, since the guest never starts executing, the VM is started in a paused state that allows the backup to execute without actually booting the guest). this does mean all the resources required for the VM to run need to be available (RAM, disks, passed-through host hardware, ..).

Thanks @fabian

Do you have any inkling then please, what might contribute to a NVMe or PCI problem either while running the VM, or just backing it up?

Does the backup process use the virtio scsi driver? Could there be low level qemu/kvm issues there, or any verbose logging or settings that could help me narrow this down please? I say this, because no end of raw DD reads from the drive outside the KVM process causes this issue.

fabian · May 19, 2022

no, qemu doesn't "use virtio scsi" to access the NVME, that is just how it gets exposed to the guest. depending on what storage "vms" is, there might be one of several other layers involved:
- ZFS
- LVM
- some filesystem
- ..

but fundamentally, a backup of a stopped VM does nothing more than just reading the full data stored for the guest image..

marcosscriven · May 19, 2022

marcosscriven said:
Thanks @fabian

Do you have any inkling then please, what might contribute to a NVMe or PCI problem either while running the VM, or just backing it up?

Does the backup process use the virtio scsi driver? Could there be low level qemu/kvm issues there, or any verbose logging or settings that could help me narrow this down please?

Thanks again.

This is probably venturing into Qemu territory that I should probably find another forum to ask.

However, if the VM is "started...in a paused state", and "just reading the full data stored for the guest image", what is Qemu process doing differently that couldn't be achieved just by reading the qemu disk images outside a qemu process?

To reiterate, I'm struggling to undertstand why the SSD fails *only* from within the running VM or backing up the VM, but any operations directly from the host never cause such an issue.

fabian · May 19, 2022

block sizes, access pattern (sequential vs. random vs ..), access modes (O_DIRECT vs regular IO) - there can be lots of differences there

marcosscriven · May 19, 2022

fabian said:
block sizes, access pattern (sequential vs. random vs ..), access modes (O_DIRECT vs regular IO) - there can be lots of differences there

Thanks @fabian

That's something to work with.

Search

Search

What processes/resources are used while doing a VM backup in "stop" mode

marcosscriven

Member

fabian

Proxmox Staff Member

marcosscriven

Member

fabian

Proxmox Staff Member

marcosscriven

Member

fabian

Proxmox Staff Member

marcosscriven

Member

We value your privacy