VM disks corrupt after reverting to snapshot

rolek · Aug 18, 2021

Hi!

In the past days, we've had three separate VMs where the disk of the VM was corrupt after reverting to a snapshot. On one of these, we had reverted to this snapshot before, succesfully. All three VM's are on Ceph storage. The settings are mixed - two Windows VMs with a virtio driver, one Linux VM with the scsi driver and discard=on. Writeback cache is used on all disks.

The Linux disk seems to be just nulls, the Windows disks still have data in them. None of them have a valid partition table now.

On the Ceph side things don't look the same either. The Linux disk has no rbd data blocks, the Windows disks do, as confirmed by using rbd info and rbd rados ls.

All of this is in line with the backups we make with proxmox-backup-server: backups made before the revert look ok and it is possible to do a file restore from them in the GUI, but it is not possible to do a file restore from backups made after the revert.

This is Proxmox 6 with Ceph Octopus (output of pveversion -v is attached).

I'd be grateful for any hints on where to start figuring out how this could have happened!

Thanks a lot, Roel

rolek · Jan 17, 2023

Hi!

I'm sorry to say that a few days back this problem happened again. The actions that have preceded the problem were similar to last time, e.g. making a snapshot of a VM and doing a rollback multiple times. At first the rollback resulted in a working system, but a later rollback resulted in a non-working system. This happened with 3 VMs that were handled the same.

Between the last occurrence of this problem and the recent one, Proxmox was upgraded to version 7, but it still has Ceph Octopus.

The systems were restored from an earlier backup and are now up and running again, but unfortunately (for me) this means I no longer have access to them in their broken state, so I cannot investigate them any further.

If problems regarding snapshots ring a bell for anyone, I'd love to hear from you.

Best regards, Roel

fabian · Jan 17, 2023

could you maybe provide the following information?

- pveversion -v
- VM config
- storage.cfg
- snapshot rollback task log, and if possible, journal from all nodes covering the rollback period (I assume Ceph is co-located/hyperconverged?)

rolek · Jan 17, 2023

Hi Fabian, thanks for your time!

Here's in short what happened with this VM:

Code:

Jan 11 14:01:37 snapshot created
Jan 11 16:11:02 first rollback -> state okay
Jan 11 22:00:04 vzdump backup -> backup okay
Jan 12 11:39:00 2nd rollback -> disk image corrupt, doesn't boot
Jan 12 11:52:42 3rd rollback -> disk image corrupt, doesn't boot
Jan 12 22:00:04 vzdump backup -> backup unreadable (with file restore)

These are the times of actions done with VM 10101, but the same problem happened with 10102 and 10103.
Attached you can find the files you requested. I added the journal of the PVE node where the affected VMs are running. If you need other info, or info of other nodes, maybe you can specify what you need? This is a 24-node cluster, with 9 PVE nodes and 13 Ceph nodes, so providing everything may be a bit much.

Thanks again!

Roel

fabian · Jan 18, 2023

the logs look okay AFAICT. do you still have the broken disks, or did you overwrite them when restoring?

rolek · Jan 18, 2023

Hi Fabian, unfortunately the original disks are gone, but I do still have a PBS backup of the disk that was made after the issue occurred and before the restore was done. Is that of any use? I reckon we should have kept the original images so we could trace into its rbd/ceph properties.
Best regards, Roel

fabian · Jan 18, 2023

unfortunately the backup doesn't really help - if you manage to trigger it again, it would be interesting to take a look at the ceph side of things, and maybe attempt to map and access the snapshot.. snapshots are obviously meant to be immutable, so the symptoms look rather strange to me.

rolek · Jan 18, 2023

Well, thanks for your time anyway! Next time we'll preserve the faulty state. With the rate of this occurring, I'll report back in a year or two.

Thanks again and best regards, Roel

Neobin · Jan 18, 2023

rolek said:
but it still has Ceph Octopus.

Did you check, if there maybe is/was a known (rare?) problem/bug in this version regarding your case?

What I want to say: Octopus/15 is EOL: [1] and maybe your (seemingly rare occurring) problem is fixed or will be gone (because of changes/improvements made, for example) in a newer version?

If this theoretically could be the case at all (I mean, snapshots and their rollback are done, in this case, on the storage layer, so here by Ceph, right?!), has to be confirmed or not by someone with knowledge; but since that version is EOL, it could/would be recommended to upgrade: [2] [3] it anyway.

At least, you could then rule out the (old) Ceph version...

[1] https://docs.ceph.com/en/latest/releases
[2] https://pve.proxmox.com/wiki/Ceph_Octopus_to_Pacific
[3] https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy

rolek · Jan 25, 2023

Neobin said:
Did you check, if there maybe is/was a known (rare?) problem/bug in this version regarding your case?

I did search through the ceph bug tracker but couldn't find anything that looked like our problem. An upgrade to Pacific has already been planned, but at the same time I'm reluctant to do upgrades in case this problem is a symptom of a problem with our cluster somehow.

rolek · Jan 25, 2023

Un(?)fortunately we hit the same problem again today. This is the same set of VMs as I reported about last time. Again, the actions that were done on the VMs were the same, i.e.:

(Yesterday)
- VM stopped
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)
- Nightly backup with vzdump to Proxmox Backup Server
(Today)
- VM stopped
- Snapshot restored
- VM does not start

I've left the situation as is, so we can investigate.

fabian · Jan 25, 2023

okay - could you try the following, with the VM stopped:

map the current disk (replace pool, XXX and Y accordingly) and get its checksum (note that the Z needs to be replaced with whatever the first command returned!):

Code:

$ rbd map -p [hdd/ssd] vm-XXX-disk-Y
/dev/rbdZ
$ sha256sum /dev/rbdZ
[checksum of whole volume]
$ dd if=/dev/rbdZ bs=512 count=2048 | sha256sum
[checksum of first MB]
$ rbd unmap /dev/rbdZ

then repeat, but with the snapshot the you rolled back to:

Code:

$ rbd map -p [hdd/ssd] vm-XXX-disk-Y@SNAPSHOT
/dev/rbdZ
$ sha256sum /dev/rbdZ
[checksum of whole volume]
$ dd if=/dev/rbdZ bs=512 count=2048 | sha256sum
[checksum of first MB]
$ rbd unmap /dev/rbdZ

the two checksums should be identical for both snapshot and the volume given your sequence of actions above.

rolek · Jan 25, 2023

fabian said:
the two checksums should be identical for both snapshot and the volume given your sequence of actions above.

The checksums of the first MB are identical, the checksums of the entire image are not. But that's probably due to the fact that I've tried to start the VM in the meantime. If I revert the snapshot, and don't start the VM, then the checksums of the whole image are identical.

What seems to be the case is that somehow the snapshot gets corrupted, and then when the image is reverted to that snapshot, the image itself is also broken.

fabian · Jan 25, 2023

could you give timestamps for the following actions:
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)

?

if you are always creating the snapshots with the VM powered off, could you gather checksums of all the disks directly after creating the snapshot, and after each rollback (successful or not) and post the results here when you next trigger the issue?

rolek · Jan 25, 2023

I've attached a screenshot of the related tasks. Backup was made Jan 24 22:00:04 - Jan 24 22:02:18.

I can't trigger the issue by running all the actions straight after one another. I'll have to work with my colleagues to ensure we can make checksums after each action when they do their thing. I'll post back when I have more info.

Thanks a lot so far, Roel

rolek · Feb 8, 2023

Hi! Back again soon, I'm afraid.

The problem still exists, but this time I have more information.
We've upgraded to Pacific in the mean time, so the problem was not related to Octopus.

In the most recent occurrence I made checksums of the images and of the snapshots in between the actions that were done, as per @fabian's advice. It turns out that the checksums of the snapshot images are different today than they were yesterday!

Some more details: our test VM has three disks attached (see attached screenshot). The first two are on our ssd pool, the third lives on the hdd pool. The image on the hdd pool has backups disabled. When comparing the checksums of the snapshots, only the snapshots of the images on the ssd pool were different today, the snapshot of the image on the hdd pool was still identical.
We use krbd for both pools.

Again, backups were made with vzdump during the night.

We're now going to extend the test setup so it has four disks, two on ssd, two on hdd, and backup only the first one of each set, to see if this is related to our ssd pool, or if it is related to the fact a backup is made of the disk.

As usual, I'd appreciate any advise or insights. In the meantime, we'll continue to test and I'll post back the results.

Best regards, Roel

fabian · Feb 8, 2023

that does sound quite surprising (but of course would explain the symptoms!).. anything else that is different/custom about your setup on the Ceph side? (rbd-mirror, cache settings, ..)? anything in the ceph logs between the good and bad state? does a (deep-)scrub show any errors?

Neobin · Feb 8, 2023

rolek said:
The Linux disk seems to be just nulls, the Windows disks still have data in them. None of them have a valid partition table now.

rolek said:
(Yesterday)
- VM stopped
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)
- Nightly backup with vzdump to Proxmox Backup Server
(Today)
- VM stopped
- Snapshot restored
- VM does not start

rolek said:
Some more details: our test VM has three disks attached (see attached screenshot). The first two are on our ssd pool, the third lives on the hdd pool. The image on the hdd pool has backups disabled. When comparing the checksums of the snapshots, only the snapshots of the images on the ssd pool were different today, the snapshot of the image on the hdd pool was still identical.

I do not want to make more noise for things I have no clue of, but could it be somehow related?:
https://bugzilla.proxmox.com/show_bug.cgi?id=2874

fabian · Feb 8, 2023

it might be part of it, but since snapshots are supposed to be immutable after creation, there is at least one more issue (likely in Ceph), else the snapshot itself couldn't be corrupt, only the current (writable) state..

edit: double-checked the VM config - all scsi, so seems even more unlikely with that given that the linked bug seems to only affect sata/ide attached volumes..

VM disks corrupt after reverting to snapshot

New Member

Attachments

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Distinguished Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Attachments

New Member

Attachments

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

We value your privacy