Urgent/important issue regarding proxmox/ceph storage and kvm virtualization!

Daniel S. · Feb 4, 2022

Hello,

We are running multiple VMs in the following environment: proxmox cluster with ceph storage - block storage - all osds are enterprise SSDs (RBD pool 3 times replicated).

ceph version: 15.2.11

All nodes inside the cluster have exactly this following version: https://pastebin.com/ugjzptQ9

We have installed an redhat based OS on one VM and started to migrate data to it from another machine (rsync)(the machine from where we started to restore is outside this cluster).

The VM had 3 virtio-scsi disks added, check below for full config disk info.
The VM had an EFI disk (all disks including EFI were located on the same ceph rbd storage) - using OVMF uefi as bios.

This is from the VM config file:

Code:

efidisk0: rbd:vm-108-disk-1,size=1M
scsi0: rbd:vm-108-disk-0,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=250G    - /dev/sda1 partition ext4
scsi1: rbd:vm-108-disk-2,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=500G    - /dev/sdb1 partition ext4
scsi2: rbd:vm-108-disk-3,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=2T         - /dev/sdc1 partition ext4

The VM was suddenly killed by oom-killer, no issue here, as we assigned too much memory to the VM. (node has 256GB of ram) and has few more VMs running on it, we added 192GB to this specific VM, so we prolly need more ram, ok, but see below what happened.

Check the logs from the node which hosted the VM when it killed it: https://pastebin.com/EUPZa9m7

The very big problem is that /dev/sda1 and /dev/sdb1 partitions do not exist anymore on the system after we boot it, it appears that something wiped/removed them up, this is unacceptable. We boot using live cds, disks are all there but there are no more partitions on /dev/sda and /dev/sdb drives - the only one which still exist and can be mounted is /dev/sdc1.

Do you have any idea about this? What could cause this behaviour?
Nothing happened on ceph, nothing suspicious on logs, no failed osds or pgs, health all the time was and is OK.

This is a very weird situation - we are running in these environments since years, using multiple different OSes on VMs, we have never encountered such issues, this has to be investigated.

If someone has a hint/clue/idea please let me know.
Thank you.

t.lamprecht · Feb 8, 2022

Daniel S. said:
The very big problem is that /dev/sda1 and /dev/sdb1 partitions do not exist anymore on the system after we boot it, it appears that something wiped/removed them up, this is unacceptable. We boot using live cds, disks are all there but there are no more partitions on /dev/sda and /dev/sdb drives - the only one which still exist and can be mounted is /dev/sdc1.

Do you have any idea about this? What could cause this behaviour?

The simplest explanation would be that the changes to partition and data simply were not committed to disk yet, i.e., in memory cache only and thus got lost when the VM was OOM killed. That theory could be supported by the fact that the VM is using the writeback cache mode, which allows some writes to return immediately, i.e., before being on the actual storage.

For ceph this would also require that either less than 24 MiB got written yet, or the rbd_cache_max_dirty config was tuned to a higher value, as else writes start to block until flushed to avoid too much outstanding writes in the cache.

In addition to that the mount options and kernel would also have something to say in that whole behavior.

As said, above I'm going for the simplest explanation in general, closer investigation would require a lot more info, possibly some experimenting in the setup and also a lot of time, that's rather out of scope for most to do in the community forum.

Daniel S. · Feb 8, 2022

Hello,

Thank you for your explanation.

Does what you mention applies if this specific VM was create like 1 month ago? Was shutdown, started/rebooted several times before starting the migration?

What I mean is that the changes were surely commited to the disks as those were created long time before.

t.lamprecht · Feb 8, 2022

Daniel S. said:
Does what you mention applies if this specific VM was create like 1 month ago? Was shutdown, started/rebooted several times before starting the migration?

No, then it's rather impossible that above theory applies in your case.

Daniel S. · Feb 8, 2022

That is what I also thought.

Now, the problem is that we are afraid to try again because we have no idea what else could happen, as there are more VMs running inside this cluster, if you have any other idea what should we check for please let us know.

RolandK · Feb 26, 2022

hi, can you check if it's just the partition table being wiped and if you are able to restore with testdisk/gpart ?

has using backup with vzdump or pbs been involved?

there is a longer standing bug on loosing partition table ( https://bugzilla.proxmox.com/show_bug.cgi?id=2874 ) and i think since there has been at least a handful of people affected, it should really get more priorization imho, because it's a bug which can lead to loosing trust into this great product. anyhow, there is no repro-case yet, so it's quite difficult to identify what's causing it.

Daniel S. · Feb 28, 2022

Can't be restored, disks exists but partition table was wiped.

No backup was running, only what I first specified in the first post .

It should get more priority because this is very alarming and can happen in a production environment.

d1_sen · Mar 22, 2022

I am experiencing the same issue with one of my VMs. Tried TestDisk, but no luck with it. Going to try out gpart.
@Daniel S. did you manage to recover your VM?

I see this bug https://bugzilla.proxmox.com/show_bug.cgi?id=2874, but not fix just yet.

Daniel S. · Mar 22, 2022

There is no way to recover anything. The bug seems similar, because when running backups I/O usage and resource usage is probably higher than normal, though we did not ran any backups but we did started a remote rsync restoration to a VM , this caused high I/O and other resource usage which led to the disaster. As above stated, this should be taken very seriously as no matter how many replicas are in your ceph pools/storage or no matter if your ceph health is ok, data is lost. Not to mention the downtime even if other backups are saved.. until you restore.. anyway you take it data is and can be lost.

d1_sen · Mar 22, 2022

I have backups running on PBS, however restoring them was of no use either. We're running over 150 VMs in the proxmox cloud, and using PBS for backups, this has become a real concern now as we do not have any other backups running. I wonder what proxmox is doing with regards to this, the least they can do is provide support and guidance.

RolandK · Nov 24, 2022

are you still affected of this bug @Daniel S. @d1_sen ?

if yes, could you please post all details of your system(s) and your affected VM(s) (VM configuration...) ?

Search

Search

Urgent/important issue regarding proxmox/ceph storage and kvm virtualization!

Daniel S.

Active Member

t.lamprecht

Proxmox Staff Member

Daniel S.

Active Member

t.lamprecht

Proxmox Staff Member

Daniel S.

Active Member

RolandK

Famous Member

Daniel S.

Active Member

d1_sen

Member

Daniel S.

Active Member

d1_sen

Member

RolandK

Famous Member

We value your privacy