qm list (and other commands) hanging.

Hyien

Member
Jun 18, 2021
95
3
13
35
I have a 3node cluster all running
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.102-1-pve)

When I rm 'qm list' or other Proxmox commands, the command simply hangs.

pvecm status
Cluster information
-------------------
Name: XXX
Config Version: 13
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Apr 3 03:24:54 2023
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1.4978
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 XXX
0x00000002 1 XXX (local)
0x00000003 1 XXX
 
Hi,
please check and post the content of the journal from around the time the command was executed journalctl --since <DATE> --until <DATE>.
 
i see a bunch of
Apr 03 08:35:50 XXX corosync[251358]: [TOTEM ] Retransmit List: 12 13 16 23 24 27 36 3d 3e 3f 40 41 43 4f 50 51 52 53 5b 5d 5e 5f 60
Apr 03 08:35:51 XXX pmxcfs[364280]: [status] notice: cpg_send_message retry 80
Apr 03 08:35:51 XXX corosync[251358]: [TOTEM ] Retransmit List: 12 13 16 23 24 27 36 3d 3e 3f 40 41 43 4f 50 51 52 53 5b 5d 5e 5f 60
Apr 03 08:35:52 XXX corosync[251358]: [TOTEM ] Retransmit List: 12 13 16 23 24 27 36 3d 3e 3f 40 41 43 4f 50 51 52 53 5b 5d 5e 5f 60
Apr 03 08:35:52 XXX pmxcfs[364280]: [status] notice: cpg_send_message retry 90
Apr 03 08:35:53 XXX pmxcfs[364280]: [status] notice: cpg_send_message retry 100
Apr 03 08:35:53 XXX pmxcfs[364280]: [status] notice: cpg_send_message retried 100 times
Apr 03 08:35:53 XXX pmxcfs[364280]: [status] crit: cpg_send_message failed: 6
Apr 03 08:35:54 XXX pmxcfs[364280]: [status] notice: cpg_send_message retry 10
Apr 03 08:35:54 XXX corosync[251358]: [TOTEM ] Token has not been received in 2738 ms
Apr 03 08:35:55 XXX pmxcfs[364280]: [status] notice: cpg_send_message retry 20
 
corosync is running at 100%. what might be causing this?
Please try to restart the pmxcfs service and see if the problem persists, systemctl restart pmxcfs.service.
Is your network operational?
 
Last edited:
Could you please provide your network config cat /etc/network/interfaces and the corosync config of all nodes cat /etc/corosync/corosync.conf. Is corosync running via a dedicated network or is it sharing the same network as the other traffic? Did you perform any changes before the problem arouse?

Input from a colleague what else to try (these have to be performed each step on all nodes before proceeding):
  • disable HA services if HA is enabled to prevent fencing
  • stop corosync & pve-cluster by running systemctl stop corosync pve-cluster
  • start corosync systemctl start corosync
  • check and post corosync logs corosync-quorumtool -s and corosync-cfgtool -n
  • start pmxcfs via systemctl start pve-cluster
  • check logs and pvecm status
 
Last edited: