VM filesystem corruption after suspending & reboot

broth-itk

Member
Dec 30, 2024
31
9
8
Dear Community,

during an update of PVE from 8.4.1 to 8.4.5, I suspended all virtual machines and rebooted the server.
Once all came back online, I was shocked to see that all of my linux VMs were either

  • freezing
  • showing systemd errors like "read only filesystem"
  • were unable to boot (initramfs, unexpected inconsistency)

This was rather unexpected and I wonder what went wrong.

The steps I did to update the system were:

  1. apt update & apt-dist-upgrade
  2. Since many packages were upgraded, I decided that a reboot was necessary
  3. Suspend all running VMs
  4. Reboot system

Now, I'm aware that the step of actually installing packages with running VMs might have been not so a good idea.
Nevertheless if this is known to be a problem, what can be done to avoid this from happening?

I think about a apt hook which reports "running VMs" when critical patches are applied.
This admin consent will act as a reminder to eventually avoid crashing the machines.


I'm happy to share my version information but since the previous versions before the reboot are not available, I wonder if this makes sense.

The server is a HPE DL380 Gen10 with Dual Xeon CPU (2nd Gen), 512GB RAM, several zpool with NVMe SSDs for the VMs.

Best regards,
Bernhard
 
Suspend all running VMs
Do you mean Hibernate or Shutdown?

If you mean the former - this is probably the source of your issue, as Hibernate will save the current VM's state to disk; pre-update, which probably does not match for post-update. Have you tried now stopping the VM/s & starting again?

If you mean the latter - you'll need to dive in deeper (logs etc.).
 
I mean "Hibernate", sorry (Suspend is VMware terminology)

Hibernation was started post-update, then reboot.

The VMs, in their resumed state, were broken.
Reboot took me to initramfs prompt.

After repairing the FS the systems are working again but this should not happen.
 
I have already outlined what went wrong.


Hibernation was started post-update
Although technically true, many/most updates are not applied to a running VM. So you're hibernation applied the running VM's state pre-update.

but this should not happen
For correct disk/state integrity you should have shutdown the VMs not hibernated them.

Most OSs that apply major updates do so in a shutdown state not a hibernated one - for this reason.
 
For correct disk/state integrity you should have shutdown the VMs not hibernated them.

Disk and state information must be valid at any time.

If a valid state cannot be guaranteed (e.g. pending updates), the system should block this action from happening.

I understand the "I should not have" argument but IMHO there are some issues:

  1. As admin you can't remember all quirks at all the time.
  2. Technically there are ways to prevent potential harmful, unintended actions
  3. In VMware environments we always Suspend & Resume/Start VMs thoughout an update/patch. Such is the force of habit.
  4. There are virtual machines which simply can't shutdown properly without manual intervention (like proprietary appliances with no qemu-guest-agent support)
 
Disk and state information must be valid at any time.
I agree to this "wish" - but definitely this isn't always the case. Compare it (in the extreme case) to migrating a running disk + ram into new HW - it probably will not function correctly.

If a valid state cannot be guaranteed (e.g. pending updates), the system should block this action from happening.
While this would have helped you - this is going to be somewhat hit & miss for the HV to detect this situation, as a lot of VMs (majority?) would not have suffered this. I suspect your VMs were doing DB activities, these never play well with the hibernation and resumption process - even BM.

As admin you can't remember all quirks at all the time.
But you are required/responsible to do so

In VMware environments we always Suspend & Resume/Start VMs thoughout an update/patch. Such is the force of habit.
But you decided not to do that with PVE, (you hibernated post-update, pre-reboot which can be a "mixed bag" as above) - even though this may not have avoided your issue.

There are virtual machines which simply can't shutdown properly without manual intervention
Same way you hibernated the VM, you could have shut it down. (I'm not suggesting this can be without issue in a non-GA VM).


If you insist on comparing your experience to VMWare then do a Web search on "vmware vm not starting after suspend after server reboot"
 
I think the conversation is going down the drain - lets agree to disagree.

The main purpose of my post is to discuss a potential issue, think about root causes and possible solutions (if applicable).

On a higher level goal should be to improve PVE overall and to make life easier for everyone using that solution.
Whatever the technical background may be.
 
Hibernating the VM isn't best practices, which recommend either migrate the VM to other host of the cluster or full shutting them down and start again (besides having backups, etc which isn't relevant in this case).

I can confirm that you can install updated packages with running VMs without any issue: the running VMs keep using the loaded binaries/libraries. A different thing are packages like openvswith-switch or frr, which might restart some service during update, potentially affecting network conectivity to the storage.

Still, this sounds weird anyway because once the VM is hibernated the whole memory is at the storage and the VM disks do not receive any other change as there's no guest OS running. If you followed the normal update paths, no QEMU major version is updated from 8.4.1 to 8.4.5. If there was, maybe the format of how QEMU saves/restores the memory during hibernation may had changed, explaining you behavior, but doesn't seem to be the case.

One of my lab machines had PVE8.4.1. It uses two nvme drives in ZFS mirror. I followed your steps:

- A few VMs running (a nested PVE, two debian 12, one Ubuntu 24.04).
- apt update && apt dist-upgrade.
- Hibernate the VMs.
- Reboot the host.
- Resume the VMs.

The all have started correctly. Even did a full fsck on all four, no errors reported.

Seems like there is some kind of interaction with some of you specific configuration, either at host leve, VM(s) and/or storage that somehow made that some data not reaching the storage and producing the behavior you saw.

Edit: finished the sentence above "A different thing are packages like openvswith-switch or frr, which might restart some service during update, potentially affecting network conectivity to the storage."
 
Last edited:
Thanks for your help trying to reproduce the issue!

I can confirm that you can install updated packages with running VMs without any issue: the running VMs keep using the loaded binaries/libraries. A different thing are packages like openvswith-switch or

Thanks for the confirmation! I have the same understanding that running updates on live systems will not affect running VMs until they are restarted.
openvswitch is not used in my case.

Seems like there is some kind of interaction with some of you specific configuration, either at host leve, VM(s) and/or storage that somehow made that some data not reaching the storage and producing the behavior you saw.

Even hibernation is not best-practice, I consider my actions not to be harmful to data.

I start to feel that my assumption, the corruption being related to hibernate & resume, is wrong.
I'm going to look into the logs to check if there were other actions happening in the meantime.

Thanks again for your guidance!