Win 2019 VM suddenly dies with empty disk (Proxmox 6.0 + Ceph)

syfy323

Member
Nov 16, 2019
80
5
8
30
Hi!

I am observing a strange issue on one installation (3 Node Proxmox Cluster + external Ceph Cluster).
Ceph is working fine as well as all other VMs. One VM, a Windows 2019 server crashs every night and loops with "No bootable device".
When I boot recovery tools, I can see the disk but it's 100% unallocated - no partitions.

The server itself is clean, no cryptolocker or something like this, as it has no network connection to a public network.

There is a backup job (application, not Proxmox) running when this happens, which leads me to the assumption, some writes are lead to the wrong place, rendering my disk useless.

Has anybody observed a similar problem?

Kind regards
Kevin
 
On which pveversion -v are you?
 
On which pveversion -v are you?

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
The ceph cluster itself is also on 14.2.4 and it uses WAL on an NVMe.
Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.
 
Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.

The log level on that node is too low but I identified a disk that might be failing.
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.
 
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.
Well, Ceph 14.2.5 has been released. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!