Win 2019 VM suddenly dies with empty disk (Proxmox 6.0 + Ceph)

syfy323 · Dec 10, 2019

Hi!

I am observing a strange issue on one installation (3 Node Proxmox Cluster + external Ceph Cluster).
Ceph is working fine as well as all other VMs. One VM, a Windows 2019 server crashs every night and loops with "No bootable device".
When I boot recovery tools, I can see the disk but it's 100% unallocated - no partitions.

The server itself is clean, no cryptolocker or something like this, as it has no network connection to a public network.

There is a backup job (application, not Proxmox) running when this happens, which leads me to the assumption, some writes are lead to the wrong place, rendering my disk useless.

Has anybody observed a similar problem?

Kind regards
Kevin

syfy323 · Dec 10, 2019

Looks like I hit this bug:
https://tracker.ceph.com/issues/42223
https://ceph.io/releases/v14-2-5-nautilus-released/

Alwin · Dec 10, 2019

On which pveversion -v are you?

syfy323 · Dec 10, 2019

Alwin said:
On which pveversion -v are you?

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

syfy323 · Dec 10, 2019

The ceph cluster itself is also on 14.2.4 and it uses WAL on an NVMe.

Alwin · Dec 10, 2019

syfy323 said:
The ceph cluster itself is also on 14.2.4 and it uses WAL on an NVMe.

Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.

syfy323 · Dec 10, 2019

Alwin said:
Do you see the same assert message in the ceph logs? And there is Ceph 14.2.4.1 in our repository that has the fix for the possible bluestore corruption included.

The log level on that node is too low but I identified a disk that might be failing.
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.

Alwin · Dec 10, 2019

syfy323 said:
The ceph cluster is using CentOS 7, not Proxmox and it's running the official build that lacks the patch.

Well, Ceph 14.2.5 has been released.

syfy323 · Dec 10, 2019

Alwin said:
Well, Ceph 14.2.5 has been released.

That's where I stumbled upon that known issue

Upgrading ASAP.

syfy323 · Dec 10, 2019

syfy323 said:
That's where I stumbled upon that known issue
Upgrading ASAP.

Upgrade finished - both proxmox and ceph are up to date.

Search

Search

Win 2019 VM suddenly dies with empty disk (Proxmox 6.0 + Ceph)

syfy323

Member

syfy323

Member

Alwin

Proxmox Retired Staff

syfy323

Member

syfy323

Member

Alwin

Proxmox Retired Staff

syfy323

Member

Alwin

Proxmox Retired Staff

syfy323

Member

syfy323

Member