Upgrade from 7 to 8 killed cluster

ropeguru

Member
Nov 18, 2019
37
2
13
66
So I upgraded from 7 to 8 following the directions and it has completely killed my cluster.

Basic setup with 4 nodes, NFS backed for all the storage (no ceph), a linux bridge and two vlans on the bridge. When the nodes startup they seem to start ok, but logging into each one and looking at cluster info it seems that every node shows something different as far as status. Syslog is constantly throwing link up/down for each host and hosts are randomly responsive and unresponsive. I have confirmed that network connectivity is good as everything pings and when the cluster portion is going haywire about network, I can go through each node and hit the shell of every other node in the cluster. At times I find the corosync service pegged @ 99% CPU and then that happens, the web gui is unresponsive I am also seeing the pveproxy dying a lot.

I have been trouble shooting the for more than a day.

pvecm status from each machine:

Code:
root@pve-3060-1:~# pvecm status
Cluster information
-------------------
Name:             PM-MECH
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Aug  5 16:44:00 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.217c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1  NA,NV,NMW 192.168.1.69 (local)
0x00000002          1  NA,NV,NMW 192.168.1.70
0x00000003          1  NA,NV,NMW 192.168.1.71
0x00000004          1  NA,NV,NMW 192.168.1.72
0x00000000          0            Qdevice (votes 0)

root@pve-3060-2:~# pvecm status
Cluster information
-------------------
Name:             PM-MECH
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Aug  5 16:45:13 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1.217c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1  NA,NV,NMW 192.168.1.69
0x00000002          1  NA,NV,NMW 192.168.1.70 (local)
0x00000003          1  NA,NV,NMW 192.168.1.71
0x00000004          1  NA,NV,NMW 192.168.1.72
0x00000000          0            Qdevice (votes 0)

root@pve-3060-3:~# pvecm status
Cluster information
-------------------
Name:             PM-MECH
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Aug  5 16:45:44 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000003
Ring ID:          1.217c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1  NA,NV,NMW 192.168.1.69
0x00000002          1  NA,NV,NMW 192.168.1.70
0x00000003          1  NA,NV,NMW 192.168.1.71 (local)
0x00000004          1  NA,NV,NMW 192.168.1.72
0x00000000          0            Qdevice (votes 0)

root@pve-3060-4:~# pvecm status
Cluster information
-------------------
Name:             PM-MECH
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Aug  5 16:46:39 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1.217c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1  NA,NV,NMW 192.168.1.69
0x00000002          1  NA,NV,NMW 192.168.1.70
0x00000003          1  NA,NV,NMW 192.168.1.71
0x00000004          1  NA,NV,NMW 192.168.1.72 (local)
0x00000000          0            Qdevice (votes 0)

I have also verified that the corosync.conf is the same on every machine in the corosync and pve folders.

Code:
root@pve-3060-4:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-3060-1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.69
  }
  node {
    name: pve-3060-2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.70
  }
  node {
    name: pve-3060-3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.71
  }
  node {
    name: pve-3060-4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.72
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: PM-MECH
  config_version: 11
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}
 
Hosts file:

Code:
root@pve-3060-4:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.69 pve-3060-1.saroza-webb.internal pve-3060-1
192.168.1.70 pve-3060-2.saroza-webb.internal pve-3060-2
192.168.1.71 pve-3060-3.saroza-webb.internal pve-3060-3
192.168.1.72 pve-3060-4.saroza-webb.internal pve-3060-4

Snippet of syslog:

Code:
2023-08-05T16:45:00.914095-04:00 pve-3060-4 corosync[946]:   [KNET  ] rx: host: 1 link: 0 is up
2023-08-05T16:45:00.914180-04:00 pve-3060-4 corosync[946]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
2023-08-05T16:45:00.914230-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2023-08-05T16:45:01.023352-04:00 pve-3060-4 corosync[946]:   [KNET  ] pmtud: Global data MTU changed to: 8885
2023-08-05T16:45:01.742137-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 40
2023-08-05T16:45:02.549542-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Token has not been received in 3225 ms
2023-08-05T16:45:02.743046-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 50
2023-08-05T16:45:03.440796-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 77 55 57 24 26 2a 48 52 72 73 76 1 2 11 4d 51 6f
2023-08-05T16:45:03.743903-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 60
2023-08-05T16:45:04.744624-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 70
2023-08-05T16:45:05.326556-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 24 26 2a 33 3f 40 48 52 55 57 67 72 73 76 77 1 2 11 51 6f
2023-08-05T16:45:05.745463-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 80
2023-08-05T16:45:06.746275-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 90
2023-08-05T16:45:07.402532-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 26 2a 48 52 55 73 76 1 2 11 6f 77
2023-08-05T16:45:07.747111-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 100
2023-08-05T16:45:07.747196-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retried 100 times
2023-08-05T16:45:07.747238-04:00 pve-3060-4 pmxcfs[843]: [status] crit: cpg_send_message failed: 6
2023-08-05T16:45:08.748507-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 10
2023-08-05T16:45:09.265435-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 26 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:09.749277-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 20
2023-08-05T16:45:10.750123-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 30
2023-08-05T16:45:11.226702-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:11.750871-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 40
2023-08-05T16:45:12.751715-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 50
2023-08-05T16:45:13.226703-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:13.752454-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 60
2023-08-05T16:45:14.753311-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 70
2023-08-05T16:45:15.333248-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:15.754052-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 80
2023-08-05T16:45:16.313984-04:00 pve-3060-4 corosync[946]:   [KNET  ] link: host: 2 link: 0 is down
2023-08-05T16:45:16.314160-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
2023-08-05T16:45:16.314223-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 2 has no active links
2023-08-05T16:45:16.714267-04:00 pve-3060-4 corosync[946]:   [KNET  ] link: host: 1 link: 0 is down
2023-08-05T16:45:16.714331-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2023-08-05T16:45:16.714379-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 1 has no active links
2023-08-05T16:45:16.754677-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 90
2023-08-05T16:45:17.755614-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 100
2023-08-05T16:45:17.755775-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retried 100 times
2023-08-05T16:45:17.755851-04:00 pve-3060-4 pmxcfs[843]: [status] crit: cpg_send_message failed: 6
2023-08-05T16:45:17.820645-04:00 pve-3060-4 pve-firewall[968]: firewall update time (20.017 seconds)
2023-08-05T16:45:18.558357-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Token has not been received in 3225 ms
2023-08-05T16:45:18.756102-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 10
2023-08-05T16:45:18.916262-04:00 pve-3060-4 corosync[946]:   [KNET  ] rx: host: 2 link: 0 is up
2023-08-05T16:45:18.916347-04:00 pve-3060-4 corosync[946]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
2023-08-05T16:45:18.916392-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
2023-08-05T16:45:19.029777-04:00 pve-3060-4 corosync[946]:   [KNET  ] pmtud: Global data MTU changed to: 8885
2023-08-05T16:45:19.226464-04:00 pve-3060-4 corosync[946]:   [KNET  ] rx: host: 1 link: 0 is up
2023-08-05T16:45:19.226555-04:00 pve-3060-4 corosync[946]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
2023-08-05T16:45:19.226599-04:00 pve-3060-4 corosync[946]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2023-08-05T16:45:19.237096-04:00 pve-3060-4 corosync[946]:   [KNET  ] pmtud: Global data MTU changed to: 8885
2023-08-05T16:45:19.331104-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:19.756733-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 20
2023-08-05T16:45:20.757333-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 30
2023-08-05T16:45:21.226700-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:21.758062-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 40
2023-08-05T16:45:22.758727-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 50
2023-08-05T16:45:23.226632-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:23.759348-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 60
2023-08-05T16:45:24.759987-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 70
2023-08-05T16:45:25.229672-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:25.760615-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 80
2023-08-05T16:45:26.761201-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 90
2023-08-05T16:45:27.762082-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 100
2023-08-05T16:45:27.762176-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retried 100 times
2023-08-05T16:45:27.762227-04:00 pve-3060-4 pmxcfs[843]: [status] crit: cpg_send_message failed: 6
2023-08-05T16:45:28.226718-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:28.763211-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 10
2023-08-05T16:45:29.764062-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 20
2023-08-05T16:45:30.334576-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:30.764941-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 30
2023-08-05T16:45:31.765756-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 40
2023-08-05T16:45:32.330845-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:32.766515-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 50
2023-08-05T16:45:33.767364-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 60
2023-08-05T16:45:34.403883-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:34.768171-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 70
2023-08-05T16:45:35.769103-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 80
2023-08-05T16:45:36.330752-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:36.770225-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 90
2023-08-05T16:45:37.770825-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 100
2023-08-05T16:45:37.770915-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retried 100 times
2023-08-05T16:45:37.770959-04:00 pve-3060-4 pmxcfs[843]: [status] crit: cpg_send_message failed: 6
2023-08-05T16:45:37.853499-04:00 pve-3060-4 pve-firewall[968]: firewall update time (20.033 seconds)
2023-08-05T16:45:38.334068-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:39.279262-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 10
2023-08-05T16:45:40.226694-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:40.280045-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 20
2023-08-05T16:45:41.280870-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 30
2023-08-05T16:45:42.281749-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 40
2023-08-05T16:45:42.330615-04:00 pve-3060-4 corosync[946]:   [TOTEM ] Retransmit List: 2a 48 52 55 73 76 1 2 11 77
2023-08-05T16:45:43.282544-04:00 pve-3060-4 pmxcfs[843]: [status] notice: cpg_send_message retry 50
 
Well, I solved it on my own.

Ended up being an issue with a Qdevice I had configured. Not sure if it was something during the upgrade or what, but I had to restart the small linux container running the daemon and a reboot of each node and all is good again...