Hello,
I am opening a forum discussion because I am experiencing an issue with the Proxmox cluster in my environment.
Indeed, I installed Proxmox VE 9.1.6 on each node and everything went very smoothly, right up until the cluster was created.
I’ll provide you with all the relevant information about my environment, followed by service logs, a record of actions taken, and so on...
The 3 nodes Proxmox Servers are installed on 3 Dell PowerEdge XR4510c servers with the dedicated networks ports following :
Here is the actions that we have taken :
Obviously we restarted the server multiple times and everything was OK and each node stable.
Then we created the clusther through the UI of the first node [pve1] in "Datacenter --> Cluster --> Create Cluster --> Select the 10.100.37.0/27i interface", copied informations and join the cluster on the 2 other nodes.
From that point on, the nodes became unstable, including issues with UI access (resulting in complete loss of access through the web, only SSH still works) and intermittent problems with the functional cluster.
Seems network issues cause D state for Proxmox services
I am opening a forum discussion because I am experiencing an issue with the Proxmox cluster in my environment.
Indeed, I installed Proxmox VE 9.1.6 on each node and everything went very smoothly, right up until the cluster was created.
I’ll provide you with all the relevant information about my environment, followed by service logs, a record of actions taken, and so on...
Environment
- Proxmox VE Version : Proxmox VE 9.1.6
- VM Version : 6.17.13-2-pve
- Proxmox VE cluster: 3 nodes
- Transport: knet
- Nodes:
- pve1: 10.100.37.250
- pve2: 10.100.37.251
- pve3: 10.100.37.252
The 3 nodes Proxmox Servers are installed on 3 Dell PowerEdge XR4510c servers with the dedicated networks ports following :
- NIC 1 (10 Gbps Ethernet Connection) --> Dedicated for the VMs network
- NIC 2 (10 Gbps Ethernet Connection) --> Dedicated for the host and cluster network (web access on 8006 port, SSH access and corosync communication for the PVE Cluster) [the 10.100.37.0/27 network]
- NIC 3 & NIC 4 (10 Gbps Optic Connection) --> Dedicated for the CEPH cluster that we didn't intalled yet.
Actions taken
Here is the actions that we have taken :
- Intalled and configured properly Proxmox VE on each node --> OK
- Add the Basic licence on each node --> OK
- Updated each node with the Enterprise repository --> OK
- Configured the network interfaces --> OK
Obviously we restarted the server multiple times and everything was OK and each node stable.
Then we created the clusther through the UI of the first node [pve1] in "Datacenter --> Cluster --> Create Cluster --> Select the 10.100.37.0/27i interface", copied informations and join the cluster on the 2 other nodes.
From that point on, the nodes became unstable, including issues with UI access (resulting in complete loss of access through the web, only SSH still works) and intermittent problems with the functional cluster.
Logs of services and some diagnostics
In the logs we see retransmit list messages:Apr 09 14:44:04 pve2-cdcserris corosync[4045]: [TOTEM ] Retransmit List: 1f 20 22 23 24 25 26 27 28 33
Apr 09 14:43:59 pve3-cdcserris corosync[3860]: [TOTEM ] Retransmit List: 28 1d
Apr 09 14:35:19 pve1-cdcserris corosync[3405]: [TOTEM ] Retransmit List: 15 16 17 19
Seems network issues cause D state for Proxmox services
Apr 09 14:31:14 pve2-cdcserris kernel: taskvestatd state
stack:0 pid:1296 tgid:1296 ppid:1 flags:0x00004002
Apr 09 14:41:24 pve1-cdcserris kernel: taskvescheduler state
stack:0 pid:4380 tgid:4380 ppid:1349 flags:0x00004002