[SOLVED] VM filesystems all broken after cluster node crashed

woodstock

Renowned Member
Feb 18, 2016
47
2
73
Hi everyone,

We are running a cluster (version 8.2.7) that is connected to a separate ceph cluster running reef (18.2.4).

It now happened two or three times over the last years that a node restarted with a hard shutdown/reset. We weren’t able to find out what triggered that.

The Node came up and running again without any problems but all VMs (with and without HA) hosted on that node had a filesystem that was messed up beyond repair. All VMs had to be restored from backups.

I’m now wondering if I can configure Proxmox and the VMs in a way that prevents this. We tried switching to direct sync cache after the last incident but this did not help this time.

Does anyone have experience with this and can suggest something?
 
Hello,
I have noticed that when a single node is running proxmox restarts itself.
But when he is part of a group this does not happen.
This is very strange, but it happens to me too and I don't know how to solve it because it happens very sporadically.
 
I'll rephrase my question: which cache settings for Proxmox (librbd) and Ceph will prevent this?
Is there a way to completely disable I/O caching for Proxmox VMs having their disks on ceph storage?
 
You should be able to configure this in the disk settings of the vm. I don't know whether you can set a node or cluster wide default
 
Thanks for your reply.

I know these settings. But I'm not sure what they really do in combination with an external ceph cluster.

We already use direct sync and this did not prevent the corrupt filesystems.
Is there any layer involved that explains that?
 
In theory direct sync makes every write, either sync or async to be pushed in sync mode to the storage. Ceph by default will commit to at least two OSD before returning ACK to the client (PVE in this case). So if both PVE and Ceph are properly configured this should not happend.

PVE settings are clear, which is the Ceph configuration in that Ceph cluster? Does this happen if you force a power off manually (i.e. can you easily reproduce the issue?).
 
I'm writing this in case others have the same problem.

We found out that our ceph user needed other permissions/capabilities. We’ve been running with:

Code:
mon = "allow r" osd = "allow * pool=poolname"

At some point in the past this was recommended to us as the minimum needed.
We had to change this to:

Code:
mgr 'profile rbd' mon 'profile rbd' osd 'profile rbd pool=poolname'

It seems that our old capabilities did not include image locking and/or unlocking.