VM disks corrupt after reverting to snapshot

Aug 18, 2021
16
1
3
45
Netherlands
Hi!

In the past days, we've had three separate VMs where the disk of the VM was corrupt after reverting to a snapshot. On one of these, we had reverted to this snapshot before, succesfully. All three VM's are on Ceph storage. The settings are mixed - two Windows VMs with a virtio driver, one Linux VM with the scsi driver and discard=on. Writeback cache is used on all disks.

The Linux disk seems to be just nulls, the Windows disks still have data in them. None of them have a valid partition table now.

On the Ceph side things don't look the same either. The Linux disk has no rbd data blocks, the Windows disks do, as confirmed by using rbd info and rbd rados ls.

All of this is in line with the backups we make with proxmox-backup-server: backups made before the revert look ok and it is possible to do a file restore from them in the GUI, but it is not possible to do a file restore from backups made after the revert.

This is Proxmox 6 with Ceph Octopus (output of pveversion -v is attached).

I'd be grateful for any hints on where to start figuring out how this could have happened!

Thanks a lot, Roel
 

Attachments

Hi!

I'm sorry to say that a few days back this problem happened again. The actions that have preceded the problem were similar to last time, e.g. making a snapshot of a VM and doing a rollback multiple times. At first the rollback resulted in a working system, but a later rollback resulted in a non-working system. This happened with 3 VMs that were handled the same.

Between the last occurrence of this problem and the recent one, Proxmox was upgraded to version 7, but it still has Ceph Octopus.

The systems were restored from an earlier backup and are now up and running again, but unfortunately (for me) this means I no longer have access to them in their broken state, so I cannot investigate them any further.

If problems regarding snapshots ring a bell for anyone, I'd love to hear from you.

Best regards, Roel
 
could you maybe provide the following information?

- pveversion -v
- VM config
- storage.cfg
- snapshot rollback task log, and if possible, journal from all nodes covering the rollback period (I assume Ceph is co-located/hyperconverged?)
 
Hi Fabian, thanks for your time!

Here's in short what happened with this VM:
Code:
Jan 11 14:01:37 snapshot created
Jan 11 16:11:02 first rollback -> state okay
Jan 11 22:00:04 vzdump backup -> backup okay
Jan 12 11:39:00 2nd rollback -> disk image corrupt, doesn't boot
Jan 12 11:52:42 3rd rollback -> disk image corrupt, doesn't boot
Jan 12 22:00:04 vzdump backup -> backup unreadable (with file restore)

These are the times of actions done with VM 10101, but the same problem happened with 10102 and 10103.
Attached you can find the files you requested. I added the journal of the PVE node where the affected VMs are running. If you need other info, or info of other nodes, maybe you can specify what you need? This is a 24-node cluster, with 9 PVE nodes and 13 Ceph nodes, so providing everything may be a bit much.

Thanks again!

Roel
 

Attachments

the logs look okay AFAICT. do you still have the broken disks, or did you overwrite them when restoring?
 
Hi Fabian, unfortunately the original disks are gone, but I do still have a PBS backup of the disk that was made after the issue occurred and before the restore was done. Is that of any use? I reckon we should have kept the original images so we could trace into its rbd/ceph properties.
Best regards, Roel
 
unfortunately the backup doesn't really help - if you manage to trigger it again, it would be interesting to take a look at the ceph side of things, and maybe attempt to map and access the snapshot.. snapshots are obviously meant to be immutable, so the symptoms look rather strange to me.
 
but it still has Ceph Octopus.

Did you check, if there maybe is/was a known (rare?) problem/bug in this version regarding your case?

What I want to say: Octopus/15 is EOL: [1] and maybe your (seemingly rare occurring) problem is fixed or will be gone (because of changes/improvements made, for example) in a newer version?

If this theoretically could be the case at all (I mean, snapshots and their rollback are done, in this case, on the storage layer, so here by Ceph, right?!), has to be confirmed or not by someone with knowledge; but since that version is EOL, it could/would be recommended to upgrade: [2] [3] it anyway.

At least, you could then rule out the (old) Ceph version...

[1] https://docs.ceph.com/en/latest/releases
[2] https://pve.proxmox.com/wiki/Ceph_Octopus_to_Pacific
[3] https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy
 
Did you check, if there maybe is/was a known (rare?) problem/bug in this version regarding your case?

I did search through the ceph bug tracker but couldn't find anything that looked like our problem. An upgrade to Pacific has already been planned, but at the same time I'm reluctant to do upgrades in case this problem is a symptom of a problem with our cluster somehow.
 
  • Like
Reactions: Neobin
Un(?)fortunately we hit the same problem again today. This is the same set of VMs as I reported about last time. Again, the actions that were done on the VMs were the same, i.e.:

(Yesterday)
- VM stopped
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)
- Nightly backup with vzdump to Proxmox Backup Server
(Today)
- VM stopped
- Snapshot restored
- VM does not start

I've left the situation as is, so we can investigate.
 
okay - could you try the following, with the VM stopped:

map the current disk (replace pool, XXX and Y accordingly) and get its checksum (note that the Z needs to be replaced with whatever the first command returned!):
Code:
$ rbd map -p [hdd/ssd] vm-XXX-disk-Y
/dev/rbdZ
$ sha256sum /dev/rbdZ
[checksum of whole volume]
$ dd if=/dev/rbdZ bs=512 count=2048 | sha256sum
[checksum of first MB]
$ rbd unmap /dev/rbdZ

then repeat, but with the snapshot the you rolled back to:
Code:
$ rbd map -p [hdd/ssd] vm-XXX-disk-Y@SNAPSHOT
/dev/rbdZ
$ sha256sum /dev/rbdZ
[checksum of whole volume]
$ dd if=/dev/rbdZ bs=512 count=2048 | sha256sum
[checksum of first MB]
$ rbd unmap /dev/rbdZ

the two checksums should be identical for both snapshot and the volume given your sequence of actions above.
 
the two checksums should be identical for both snapshot and the volume given your sequence of actions above.

The checksums of the first MB are identical, the checksums of the entire image are not. But that's probably due to the fact that I've tried to start the VM in the meantime. If I revert the snapshot, and don't start the VM, then the checksums of the whole image are identical.

What seems to be the case is that somehow the snapshot gets corrupted, and then when the image is reverted to that snapshot, the image itself is also broken.
 
Last edited:
could you give timestamps for the following actions:
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)

?

if you are always creating the snapshots with the VM powered off, could you gather checksums of all the disks directly after creating the snapshot, and after each rollback (successful or not) and post the results here when you next trigger the issue?
 
I've attached a screenshot of the related tasks. Backup was made Jan 24 22:00:04 - Jan 24 22:02:18.

I can't trigger the issue by running all the actions straight after one another. I'll have to work with my colleagues to ensure we can make checksums after each action when they do their thing. I'll post back when I have more info.

Thanks a lot so far, Roel
 

Attachments

  • tasks-10101.png
    tasks-10101.png
    65.2 KB · Views: 7
Hi! Back again soon, I'm afraid.

The problem still exists, but this time I have more information.
We've upgraded to Pacific in the mean time, so the problem was not related to Octopus.

In the most recent occurrence I made checksums of the images and of the snapshots in between the actions that were done, as per @fabian's advice. It turns out that the checksums of the snapshot images are different today than they were yesterday!

Some more details: our test VM has three disks attached (see attached screenshot). The first two are on our ssd pool, the third lives on the hdd pool. The image on the hdd pool has backups disabled. When comparing the checksums of the snapshots, only the snapshots of the images on the ssd pool were different today, the snapshot of the image on the hdd pool was still identical.
We use krbd for both pools.

Again, backups were made with vzdump during the night.

We're now going to extend the test setup so it has four disks, two on ssd, two on hdd, and backup only the first one of each set, to see if this is related to our ssd pool, or if it is related to the fact a backup is made of the disk.

As usual, I'd appreciate any advise or insights. In the meantime, we'll continue to test and I'll post back the results.

Best regards, Roel
 

Attachments

  • vm-hardware.png
    vm-hardware.png
    49.6 KB · Views: 6
that does sound quite surprising (but of course would explain the symptoms!).. anything else that is different/custom about your setup on the Ceph side? (rbd-mirror, cache settings, ..)? anything in the ceph logs between the good and bad state? does a (deep-)scrub show any errors?
 
The Linux disk seems to be just nulls, the Windows disks still have data in them. None of them have a valid partition table now.
(Yesterday)
- VM stopped
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)
- Nightly backup with vzdump to Proxmox Backup Server

(Today)
- VM stopped
- Snapshot restored
- VM does not start
Some more details: our test VM has three disks attached (see attached screenshot). The first two are on our ssd pool, the third lives on the hdd pool. The image on the hdd pool has backups disabled. When comparing the checksums of the snapshots, only the snapshots of the images on the ssd pool were different today, the snapshot of the image on the hdd pool was still identical.

I do not want to make more noise for things I have no clue of, but could it be somehow related?:
https://bugzilla.proxmox.com/show_bug.cgi?id=2874
 
  • Like
Reactions: rolek
it might be part of it, but since snapshots are supposed to be immutable after creation, there is at least one more issue (likely in Ceph), else the snapshot itself couldn't be corrupt, only the current (writable) state..

edit: double-checked the VM config - all scsi, so seems even more unlikely with that given that the linked bug seems to only affect sata/ide attached volumes..
 
Last edited:
  • Like
Reactions: Neobin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!