Hello,
We are running multiple VMs in the following environment: proxmox cluster with ceph storage - block storage - all osds are enterprise SSDs (RBD pool 3 times replicated).
ceph version: 15.2.11
All nodes inside the cluster have exactly this following version: https://pastebin.com/ugjzptQ9
We have installed an redhat based OS on one VM and started to migrate data to it from another machine (rsync)(the machine from where we started to restore is outside this cluster).
The VM had 3 virtio-scsi disks added, check below for full config disk info.
The VM had an EFI disk (all disks including EFI were located on the same ceph rbd storage) - using OVMF uefi as bios.
This is from the VM config file:
The VM was suddenly killed by oom-killer, no issue here, as we assigned too much memory to the VM. (node has 256GB of ram) and has few more VMs running on it, we added 192GB to this specific VM, so we prolly need more ram, ok, but see below what happened.
Check the logs from the node which hosted the VM when it killed it: https://pastebin.com/EUPZa9m7
The very big problem is that /dev/sda1 and /dev/sdb1 partitions do not exist anymore on the system after we boot it, it appears that something wiped/removed them up, this is unacceptable. We boot using live cds, disks are all there but there are no more partitions on /dev/sda and /dev/sdb drives - the only one which still exist and can be mounted is /dev/sdc1.
Do you have any idea about this? What could cause this behaviour?
Nothing happened on ceph, nothing suspicious on logs, no failed osds or pgs, health all the time was and is OK.
This is a very weird situation - we are running in these environments since years, using multiple different OSes on VMs, we have never encountered such issues, this has to be investigated.
If someone has a hint/clue/idea please let me know.
Thank you.
We are running multiple VMs in the following environment: proxmox cluster with ceph storage - block storage - all osds are enterprise SSDs (RBD pool 3 times replicated).
ceph version: 15.2.11
All nodes inside the cluster have exactly this following version: https://pastebin.com/ugjzptQ9
We have installed an redhat based OS on one VM and started to migrate data to it from another machine (rsync)(the machine from where we started to restore is outside this cluster).
The VM had 3 virtio-scsi disks added, check below for full config disk info.
The VM had an EFI disk (all disks including EFI were located on the same ceph rbd storage) - using OVMF uefi as bios.
This is from the VM config file:
Code:
efidisk0: rbd:vm-108-disk-1,size=1M
scsi0: rbd:vm-108-disk-0,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=250G - /dev/sda1 partition ext4
scsi1: rbd:vm-108-disk-2,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=500G - /dev/sdb1 partition ext4
scsi2: rbd:vm-108-disk-3,backup=0,cache=writeback,discard=on,iothread=1,queues=8,size=2T - /dev/sdc1 partition ext4
The VM was suddenly killed by oom-killer, no issue here, as we assigned too much memory to the VM. (node has 256GB of ram) and has few more VMs running on it, we added 192GB to this specific VM, so we prolly need more ram, ok, but see below what happened.
Check the logs from the node which hosted the VM when it killed it: https://pastebin.com/EUPZa9m7
The very big problem is that /dev/sda1 and /dev/sdb1 partitions do not exist anymore on the system after we boot it, it appears that something wiped/removed them up, this is unacceptable. We boot using live cds, disks are all there but there are no more partitions on /dev/sda and /dev/sdb drives - the only one which still exist and can be mounted is /dev/sdc1.
Do you have any idea about this? What could cause this behaviour?
Nothing happened on ceph, nothing suspicious on logs, no failed osds or pgs, health all the time was and is OK.
This is a very weird situation - we are running in these environments since years, using multiple different OSes on VMs, we have never encountered such issues, this has to be investigated.
If someone has a hint/clue/idea please let me know.
Thank you.