[SOLVED] 5.8 kernel , vzdump to local storage results in node fenced

RobFantini

Famous Member
May 24, 2012
2,042
109
133
Boston,Mass
Hello
we have a 5 node cluster.

a couple of months ago we had this issue, to solve I pinned a 6.5 kernel.

Last night I unpinned and booted the 5 nodes to use 6.8.4-2-pve.

at 2AM shortly after a vzdump backup to this storage :
Code:
dir: z-local-nvme
        path /nvme-ext4
        content images,backup,vztmpl,iso,snippets,rootdir
        prune-backups keep-last=1
        shared 0
one of the nodes got fenced.
HA vm's were migrated.

all 5 nodes run the backup at the same time.

I pinned 6.5.13-5-pve and rebooted all nodes
--------------------------------

So it looks like there is a bug or a bad configuration here.

Any suggestions to fix this issue?
Is more data needed?
 
HI,
one of the nodes got fenced.
when you say it got fenced, do you mean it was rebooted or did it lost connection to the quorate network segment?
This sounds more like a node local issue to me, as the backup is performed to a local disk, therefore the network should be fine.

Any suggestions to fix this issue?
Is more data needed?
Can you provide an excerpt of the systemd journal of all the nodes in the cluster around the time when this issue appeared? journalctl --since <DATETIME> --until <DATETIME> > $(hostname)-systemd-journal.log should provide us more information.
 
HI,

when you say it got fenced, do you mean it was rebooted or did it lost connection to the quorate network segment?
This sounds more like a node local issue to me, as the backup is performed to a local disk, therefore the network should be fine.


Can you provide an excerpt of the systemd journal of all the nodes in the cluster around the time when this issue appeared? journalctl --since <DATETIME> --until <DATETIME> > $(hostname)-systemd-journal.log should provide us more information.

Hi Chris,
the node got disconnected to the cluster network.

so there could be faulty hardware as the local write to nvme disk could be causing some network connect issue.

I will try using a 6.8 kernel to another node. and keep 6.5 to the rest.
 
it looks like the issue was caused by a hardware issue on one node.

I unpinned older kernel from the other 4 nodes two weeks ago, and the issue did not occur.