[SOLVED] PVE 5.0.21 : cluster stopped working after reboot of a failed node while all LCXs on that node were running before reboot…

Sxilderik · Jan 9, 2020

Hello

I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.

Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.

As you’re reading this, you can guess it did not!

I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.

pveversion -v on each node gives the exact same version list.

journalctl -b -u corosync.service on failed node (makemake):

Code:

11:14:56 corosync[31791]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
11:14:56 corosync[31791]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
11:14:56 corosync[31791]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
11:14:56 corosync[31791]:   [MAIN  ] Please migrate config file to nodelist.
11:14:56 corosync[31791]:   [TOTEM ] Initializing transport (Kronosnet).
11:14:56 corosync[31791]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
11:14:56 corosync[31791]:   [TOTEM ] totemknet initialized
11:14:56 corosync[31791]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
11:14:56 corosync[31791]:   [QB    ] server name: cmap
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration service [1]
11:14:56 corosync[31791]:   [QB    ] server name: cfg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
11:14:56 corosync[31791]:   [QB    ] server name: cpg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
11:14:56 corosync[31791]:   [WD    ] Watchdog not enabled by configuration
11:14:56 corosync[31791]:   [WD    ] resource load_15min missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] resource memory_used missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] no resources configured.
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
11:14:56 corosync[31791]:   [QUORUM] Using quorum provider corosync_votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
11:14:56 corosync[31791]:   [QB    ] server name: votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
11:14:56 corosync[31791]:   [QB    ] server name: quorum
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 has no active links
11:14:56 corosync[31791]:   [TOTEM ] A new membership (2:384) was formed. Members joined: 2
11:14:56 corosync[31791]:   [CPG   ] downlist left_list: 0 received
11:14:56 corosync[31791]:   [QUORUM] Members[1]: 2
11:14:56 corosync[31791]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:14:56 systemd[1]: Started Corosync Cluster Engine.
11:14:57 corosync[31791]:   [KNET  ] rx: host: 1 link: 0 is up
11:14:57 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:57 corosync[31791]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
11:14:57 corosync[31791]:   [KNET  ] pmtud: Global data MTU changed to: 1397

journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)

Code:

11:31:26 corosync[1238]:   [TOTEM ] A new membership (1:5068912) was formed. Members
11:31:26 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:26 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:26 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:28 corosync[1238]:   [TOTEM ] A new membership (1:5068916) was formed. Members
11:31:28 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:28 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:28 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:29 corosync[1238]:   [TOTEM ] A new membership (1:5068920) was formed. Members
11:31:29 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:29 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:29 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:30 corosync[1238]:   [TOTEM ] A new membership (1:5068924) was formed. Members
11:31:30 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:30 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:30 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.

/etc/corosync/corosync.conf, identical on both machines

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: haumea
    nodeid: 1
    quorum_votes: 1
    ring0_addr: haumea.coheris.com
  }
  node {
    name: makemake
    nodeid: 2
    quorum_votes: 1
    ring0_addr: makemake.coheris.com
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: devs-spad
  config_version: 2
  interface {
    bindnetaddr: haumea.coheris.com
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

omping haumea makemake on each node seems to be working

Code:

root@haumea:~# omping haumea makemake
makemake : waiting for response msg
makemake : waiting for response msg
makemake : joined (S,G) = (*, 232.43.211.234), pinging
makemake :   unicast, seq=1, size=69 bytes, dist=0, time=0.269ms
makemake : multicast, seq=1, size=69 bytes, dist=0, time=0.272ms
^C
makemake :   unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.269/0.398/0.570/0.128
makemake : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.272/0.410/0.576/0.125

on haumea, journalctl -b -u pve* gives:

Code:

janv. 09 11:44:06 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:06 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pvesr[9594]: error with cfs lock 'file-replication_cfg': no quorum!
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Failed with result 'exit-code'.
janv. 09 11:44:09 haumea systemd[1]: Failed to start Proxmox VE replication runner.
janv. 09 11:44:10 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!

on makemake, same command:

Code:

see output in first reply, I broke the 10000 chars limit

I don’t know where to look now… Can anyone help ?

Thanks !

Sxilderik · Jan 9, 2020

Sorry, broke the 10000 chars limit

on makemake (failed node), same command (journalctl -b -u pve*):

Code:

10:48:55 systemd[1]: Starting Proxmox VE Login Banner...
10:48:55 systemd[1]: Starting Commit Proxmox VE network changes...
10:48:55 systemd[1]: Starting Proxmox VE firewall logger...
10:48:55 systemd[1]: Started Commit Proxmox VE network changes.
10:48:56 pvefw-logger[696]: starting pvefw logger
10:48:56 systemd[1]: Started Proxmox VE firewall logger.
10:48:56 systemd[1]: Started Proxmox VE replication runner.
10:48:56 systemd[1]: Started Daily PVE download activities.
10:48:59 systemd[1]: Reached target PVE Storage Target.
10:49:00 systemd[1]: Starting Proxmox VE replication runner...
10:49:02 systemd[1]: Starting The Proxmox VE cluster filesystem...
10:49:03 pmxcfs[1101]: [quorum] crit: quorum_initialize failed: 2
10:49:03 pmxcfs[1101]: [quorum] crit: can't initialize service
10:49:03 pmxcfs[1101]: [confdb] crit: cmap_initialize failed: 2
10:49:03 pmxcfs[1101]: [confdb] crit: can't initialize service
10:49:03 pmxcfs[1101]: [dcdb] crit: cpg_initialize failed: 2
10:49:03 pmxcfs[1101]: [dcdb] crit: can't initialize service
10:49:03 pmxcfs[1101]: [status] crit: cpg_initialize failed: 2
10:49:03 pmxcfs[1101]: [status] crit: can't initialize service
10:49:03 systemd[1]: Started Proxmox VE Login Banner.
10:49:04 systemd[1]: Started The Proxmox VE cluster filesystem.
10:49:04 systemd[1]: Starting PVE Status Daemon...
10:49:04 systemd[1]: Starting Proxmox VE firewall...
10:49:06 systemd[1]: Starting PVE API Daemon...
10:49:07 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:07 pve-firewall[1158]: starting server
10:49:07 systemd[1]: Started Proxmox VE firewall.
10:49:07 pvestatd[1166]: starting server
10:49:07 systemd[1]: Started PVE Status Daemon.
10:49:08 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:09 pmxcfs[1101]: [status] notice: update cluster info (cluster name  devs-spad, version = 2)
10:49:09 pmxcfs[1101]: [dcdb] notice: members: 2/1101
10:49:09 pmxcfs[1101]: [dcdb] notice: all data is up to date
10:49:09 pmxcfs[1101]: [status] notice: members: 2/1101
10:49:09 pmxcfs[1101]: [status] notice: all data is up to date
10:49:09 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:10 pvedaemon[1179]: starting server
10:49:10 pvedaemon[1179]: starting 3 worker(s)
10:49:10 pvedaemon[1179]: worker 1180 started
10:49:10 pvedaemon[1179]: worker 1181 started
10:49:10 pvedaemon[1179]: worker 1182 started
10:49:10 systemd[1]: Started PVE API Daemon.
10:49:10 systemd[1]: Starting PVE API Proxy Server...
10:49:10 systemd[1]: Starting PVE Cluster Resource Manager Daemon...
10:49:10 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:10 pve-ha-crm[1329]: starting server
10:49:10 pve-ha-crm[1329]: status change startup => wait_for_quorum
10:49:10 systemd[1]: Started PVE Cluster Resource Manager Daemon.

bill209 · Jan 11, 2020

@Sxilderik , i have a similar issue with VE-6, and just posted here: https://forum.proxmox.com/threads/emergency-mode-possible-cluster-issue.63175/

i just saw your posting and will keep a watch on it to see if it gets resolved. good luck! : )

Sxilderik · Jan 13, 2020

Result on connecting on makemake:8006 and haumea:8006

I guess this is what is called a “split brain”…

How can I recover from that? What are my options?

Thanks for any help…

Sxilderik · Jan 13, 2020

Hmm, output of corosync-cfgtool -s on both nodes

nodeid 1 is haumea
nodeid 2 is makemake

Code:

root@haumea:~# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
        addr    = 172.16.1.125
        status:
                nodeid  1:      link enabled:1  link connected:1
                nodeid  2:      link enabled:1  link connected:0

Code:

root@makemake:~# corosync-cfgtool -s
Printing link status.
Local node ID 2
LINK ID 0
        addr    = 172.16.1.228
        status:
                nodeid  1:      link enabled:1  link connected:1
                nodeid  2:      link enabled:1  link connected:1

On haumea, link to makemake is enabled but not connected
On makemake, all seems ok

Any clue?

Sxilderik · Jan 13, 2020

Found solution in this thread : https://github.com/corosync/corosync/issues/506

I restarted corosync.service on haumea, and it worked !

Search

Search

[SOLVED] PVE 5.0.21 : cluster stopped working after reboot of a failed node while all LCXs on that node were running before reboot…

Sxilderik

Member

Sxilderik

Member

bill209

Member

Sxilderik

Member

Sxilderik

Member

Sxilderik

Member

We value your privacy