Crash with data-loss in a 2node cluster

MH_MUC

Well-Known Member
May 24, 2019
67
6
48
37
Hi everyone,

I have a problem and I hope that anyone knows how to prevent that in the future.

I am running a two-node cluster with zfs-replication and HA both with ZFS both running
Kernel Linux 5.15.131-2-pve #1 SMP PVE 5.15.131-3
pve-manager/7.4-17

Both servers have local-zfs (rpool) and a larger data-pool.
Last night server1 wasn't reachable anymore so the server2 sent me the fencing messages. However it didn't bring the VMs up and server2 wasn't reachable through SSH.

I sshd into server1 which was reachable again, but showd no diskspace on /
I was a little bit in a panic mode so I didn't check the actual ZFS status. I just deleted some template files to free some disk-space and rebooted the machine.
It was than up and running again showing serveral hundred GB of free disk space in both-pools.

After a hard reboot of server2 it was reachable again and brought up all VMs (also the ones previously hosted on server1 with no failback option). This would have been the expected behaviour during fencing.

However a major dataloss occurred. Everything between the eveninng of the 9th and last night is gone missing and also not included in the proxmox-backups on the pbs.

Last log message in the VMs syslogs:
Code:
Jan  9 21:00:00 hostname qemu-ga: info: guest-ping called
Jan  9 21:00:01 hostname qemu-ga: info: guest-fsfreeze called

I think somehow the freeze of the guest fs for the 22 Local Time backup never got released on the 9th which ended up in a filled zfs pool, which ended up in a servercrash because no free disk space was left and which automatically got released with dataloss after the reboot.
On the other hand I can see a usage increase of the zfs beginning 2 hours before the crash yesterday.

Is there a way to confirm that and how can I make sure this doesn't happen again?

Thank you for any help!