VM filesystems all broken after cluster node crashed

woodstock

Renowned Member
Feb 18, 2016
45
2
73
Hi everyone,

We are running a cluster (version 8.2.7) that is connected to a separate ceph cluster running reef (18.2.4).

It now happened two or three times over the last years that a node restarted with a hard shutdown/reset. We weren’t able to find out what triggered that.

The Node came up and running again without any problems but all VMs (with and without HA) hosted on that node had a filesystem that was messed up beyond repair. All VMs had to be restored from backups.

I’m now wondering if I can configure Proxmox and the VMs in a way that prevents this. We tried switching to direct sync cache after the last incident but this did not help this time.

Does anyone have experience with this and can suggest something?
 
Hello,
I have noticed that when a single node is running proxmox restarts itself.
But when he is part of a group this does not happen.
This is very strange, but it happens to me too and I don't know how to solve it because it happens very sporadically.
 
I'll rephrase my question: which cache settings for Proxmox (librbd) and Ceph will prevent this?
Is there a way to completely disable I/O caching for Proxmox VMs having their disks on ceph storage?
 
You should be able to configure this in the disk settings of the vm. I don't know whether you can set a node or cluster wide default
 
Thanks for your reply.

I know these settings. But I'm not sure what they really do in combination with an external ceph cluster.

We already use direct sync and this did not prevent the corrupt filesystems.
Is there any layer involved that explains that?
 
In theory direct sync makes every write, either sync or async to be pushed in sync mode to the storage. Ceph by default will commit to at least two OSD before returning ACK to the client (PVE in this case). So if both PVE and Ceph are properly configured this should not happend.

PVE settings are clear, which is the Ceph configuration in that Ceph cluster? Does this happen if you force a power off manually (i.e. can you easily reproduce the issue?).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!