pve6to7 : corosync.conf (5) and pmxcfs (6) don't agree about size of nodelist

PointPubMedia · Aug 3, 2022

Hi,

We are planning to upgrade from 6 to 7 but on 1 out of 5 nodes, we got this:

Analzying quorum settings and state..
FAIL: 1 nodes are offline!
INFO: configured votes - nodes: 5
INFO: configured votes - qdevice: 0
INFO: current expected votes: 5
INFO: current total votes: 5
FAIL: corosync.conf (5) and pmxcfs (6) don't agree about size of nodelist.

fiona · Aug 4, 2022

Hi,
is /etc/corosync/corosync.conf the same as /etc/pve/corosync.conf on that node? What is the output of cat /etc/pve/corosync.conf and pvecm status?

PointPubMedia · Aug 4, 2022

/etc/corosync/corosync.conf and /etc/pve/corosync.conf are the same!


logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve11
    nodeid: 2
    quorum_votes: 1
    ring0_addr: x.x.x.242
    ring1_addr: x.x.y.242
  }
  node {
    name: pve12
    nodeid: 4
    quorum_votes: 1
    ring0_addr: x.x.x.243
    ring1_addr: x.x.y.243
  }
  node {
    name: pve13
    nodeid: 5
    quorum_votes: 1
    ring0_addr: x.x.x.244
    ring1_addr: x.x.y.244
  }
  node {
    name: pve14
    nodeid: 6
    quorum_votes: 1
    ring0_addr: x.x.x.245
    ring1_addr: x.x.y.245
  }
  node {
    name: pve15
    nodeid: 3
    quorum_votes: 1
    ring0_addr: x.x.x.246
    ring1_addr: x.x.y.246
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: QSE-PVE
  config_version: 9
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}


Cluster information
-------------------
Name:             QSE-PVE
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Aug  4 06:47:43 2022
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000004
Ring ID:          2.e4e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 x.x.x.242
0x00000003          1 x.x.x.246
0x00000004          1 x.x.x.243 (local)
0x00000005          1 x.x.x.244
0x00000006          1 x.x.x.245

fiona · Aug 4, 2022

What is the output of cat /etc/pve/.members and journalctl -b -u pve-cluster.service?

Does systemctl reload-or-restart pve-cluster.service on the problematic node help?

Was there a sixth node in the past? How did you remove it?

PointPubMedia · Aug 4, 2022


{
"nodename": "pve12",
"version": 7,
"cluster": { "name": "QSE-PVE", "version": 8, "nodes": 6, "quorate": 1 },
"nodelist": {
  "pve15": { "id": 3, "online": 1, "ip": "x.x.x.246"},
  "pve01": { "id": 1, "online": 0},
  "pve11": { "id": 2, "online": 1, "ip": "x.x.x.242"},
  "pve12": { "id": 4, "online": 1, "ip": "x.x.x.243"},
  "pve13": { "id": 5, "online": 1, "ip": "x.x.x.244"},
  "pve14": { "id": 6, "online": 1, "ip": "x.x.x.245"}
  }
}

Restart of pve-cluster didn't help!


Aug  4 07:44:29 pve12 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug  4 07:44:29 pve12 systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: update cluster info (cluster name  QSE-PVE, version = 8)
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: node has quorum
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: starting data syncronisation
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: received sync request (epoch 2/9970/00000005)
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: starting data syncronisation
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: received sync request (epoch 2/9970/00000005)
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: received all states
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: leader is 2/9970
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: synced members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386
Aug  4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: all data is up to date
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: received all states
Aug  4 07:44:29 pve12 pmxcfs[17008]: [status] notice: all data is up to date
Aug  4 07:44:30 pve12 systemd[1]: Started The Proxmox VE cluster filesystem.

We remove pve01 using the "how to" from proxmox website as we did a lot of time in the past. If I remember correctly, we removed pve19 also and it works fine on all other node!

On all pve11,12,13,14,15 , the only issue it's pve12 that still "see" pve01

In journalctl, we got nothing except a lot of "received log"

PointPubMedia · Aug 4, 2022

Just restart corosync service and .members is now correct!

pve6to7 also report everything fine!

PointPubMedia · Aug 4, 2022

@fiona

We just upgraded and everything works well... but in the web interface, we are having

Is there a way to completely remove pve01 and pve19 ?

fiona · Aug 5, 2022

Do you have any HA services configured currently?

Please provide the output of the following:

Code:

ha-manager status -v
cat /etc/pve/ha/manager_status
cat /etc/pve/nodes/pve19/lrm_status
cat /etc/pve/nodes/pve01/lrm_status

PointPubMedia · Aug 5, 2022


quorum OK
master pve15 (active, Fri Aug  5 06:48:17 2022)
lrm pve01 (maintenance mode, Sat Feb 26 09:37:13 2022)
lrm pve11 (idle, Fri Aug  5 06:48:17 2022)
lrm pve12 (idle, Fri Aug  5 06:48:19 2022)
lrm pve13 (idle, Fri Aug  5 06:48:20 2022)
lrm pve14 (idle, Fri Aug  5 06:48:17 2022)
lrm pve15 (idle, Fri Aug  5 06:48:17 2022)
lrm pve19 (maintenance mode, Sun May 31 16:21:28 2020)
full cluster state:
{
   "lrm_status" : {
      "pve01" : {
         "mode" : "maintenance",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1645886233
      },
      "pve11" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1659696497
      },
      "pve12" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1659696499
      },
      "pve13" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1659696500
      },
      "pve14" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1659696497
      },
      "pve15" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1659696497
      },
      "pve19" : {
         "mode" : "maintenance",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1590956488
      }
   },
   "manager_status" : {
      "master_node" : "pve15",
      "node_status" : {
         "pve01" : "maintenance",
         "pve11" : "online",
         "pve12" : "online",
         "pve13" : "online",
         "pve14" : "online",
         "pve15" : "online",
         "pve19" : "maintenance"
      },
      "service_status" : {},
      "timestamp" : 1659696497
   },
   "quorum" : {
      "node" : "pve15",
      "quorate" : "1"
   }
}


cat /etc/pve/ha/manager_status
{"timestamp":1659696527,"master_node":"pve15","service_status":{},"node_status":{"pve11":"online","pve15":"online","pve14":"online","pve12":"online","pve13":"online","pve19":"maintenance","pve01":"maintenance"}}
cat /etc/pve/nodes/pve19/lrm_status
{"mode":"maintenance","timestamp":1590956488,"state":"wait_for_agent_lock","results":{}}
cat /etc/pve/nodes/pve01/lrm_status
{"mode":"maintenance","timestamp":1645886233,"results":{},"state":"wait_for_agent_lock"}

fiona · Aug 5, 2022

PointPubMedia said:
cat /etc/pve/nodes/pve19/lrm_status {"mode":"maintenance","timestamp":1590956488,"state":"wait_for_agent_lock","results":{}}root@pve15:~# cat /etc/pve/nodes/pve01/lrm_status {"mode":"maintenance","timestamp":1645886233,"results":{},"state":"wait_for_agent_lock"}root@pve15:~#

I guess the HA manager thinks that the LRM for these two nodes still exists, because of these left-over files. After removing these files, the manager should switch the LRM status unkown and the nodes should disappear after a while (IIRC an hour).

You might even want to remove the whole directories for the gone nodes after doing a safety check if anything in there is still needed.

PointPubMedia · Aug 5, 2022

Yeah I already check and there's nothing good in pve19 and pve01, so I just need to remove /etc/pve/nodes/{pve01,pve19} on each node ?

fiona · Aug 5, 2022

It's enough to do it on one node. The filesystem in /etc/pve is shared between all nodes. See here if you're interested.

Search

Search

pve6to7 : corosync.conf (5) and pmxcfs (6) don't agree about size of nodelist

PointPubMedia

Active Member

fiona

Proxmox Staff Member

PointPubMedia

Active Member

fiona

Proxmox Staff Member

PointPubMedia

Active Member

PointPubMedia

Active Member

PointPubMedia

Active Member

fiona

Proxmox Staff Member

PointPubMedia

Active Member

fiona

Proxmox Staff Member

PointPubMedia

Active Member

fiona

Proxmox Staff Member

We value your privacy