Win 2019 VM suddenly dies with empty disk (Proxmox 6.0 + Ceph)

syfy323

Well-Known Member
Nov 16, 2019
80
5
48
31
Hi!

I am observing a strange issue on one installation (3 Node Proxmox Cluster + external Ceph Cluster).
Ceph is working fine as well as all other VMs. One VM, a Windows 2019 server crashs every night and loops with "No bootable device".
When I boot recovery tools, I can see the disk but it's 100% unallocated - no partitions.

The server itself is clean, no cryptolocker or something like this, as it has no network connection to a public network.

There is a backup job (application, not Proxmox) running when this happens, which leads me to the assumption, some writes are lead to the wrong place, rendering my disk useless.

Has anybody observed a similar problem?

Kind regards
Kevin
 
On which pveversion -v are you?
 
On which pveversion -v are you?

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
The ceph cluster itself is also on 14.2.4 and it uses WAL on an NVMe.
Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.
 
Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.

The log level on that node is too low but I identified a disk that might be failing.
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.
 
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.
Well, Ceph 14.2.5 has been released. ;)