[SOLVED] 5.8 kernel , vzdump to local storage results in node fenced

RobFantini

Famous Member
May 24, 2012
2,018
102
133
Boston,Mass
Hello
we have a 5 node cluster.

a couple of months ago we had this issue, to solve I pinned a 6.5 kernel.

Last night I unpinned and booted the 5 nodes to use 6.8.4-2-pve.

at 2AM shortly after a vzdump backup to this storage :
Code:
dir: z-local-nvme
        path /nvme-ext4
        content images,backup,vztmpl,iso,snippets,rootdir
        prune-backups keep-last=1
        shared 0
one of the nodes got fenced.
HA vm's were migrated.

all 5 nodes run the backup at the same time.

I pinned 6.5.13-5-pve and rebooted all nodes
--------------------------------

So it looks like there is a bug or a bad configuration here.

Any suggestions to fix this issue?
Is more data needed?
 
HI,
one of the nodes got fenced.
when you say it got fenced, do you mean it was rebooted or did it lost connection to the quorate network segment?
This sounds more like a node local issue to me, as the backup is performed to a local disk, therefore the network should be fine.

Any suggestions to fix this issue?
Is more data needed?
Can you provide an excerpt of the systemd journal of all the nodes in the cluster around the time when this issue appeared? journalctl --since <DATETIME> --until <DATETIME> > $(hostname)-systemd-journal.log should provide us more information.
 
HI,

when you say it got fenced, do you mean it was rebooted or did it lost connection to the quorate network segment?
This sounds more like a node local issue to me, as the backup is performed to a local disk, therefore the network should be fine.


Can you provide an excerpt of the systemd journal of all the nodes in the cluster around the time when this issue appeared? journalctl --since <DATETIME> --until <DATETIME> > $(hostname)-systemd-journal.log should provide us more information.

Hi Chris,
the node got disconnected to the cluster network.

so there could be faulty hardware as the local write to nvme disk could be causing some network connect issue.

I will try using a 6.8 kernel to another node. and keep 6.5 to the rest.
 
it looks like the issue was caused by a hardware issue on one node.

I unpinned older kernel from the other 4 nodes two weeks ago, and the issue did not occur.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!