[SOLVED] PVE 5.0.21 : cluster stopped working after reboot of a failed node while all LCXs on that node were running before reboot…

Sxilderik

Member
Jun 27, 2019
7
0
6
65
Hello

I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.

Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.

As you’re reading this, you can guess it did not!

I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.

pveversion -v on each node gives the exact same version list.

journalctl -b -u corosync.service on failed node (makemake):
Code:
11:14:56 corosync[31791]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
11:14:56 corosync[31791]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
11:14:56 corosync[31791]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
11:14:56 corosync[31791]:   [MAIN  ] Please migrate config file to nodelist.
11:14:56 corosync[31791]:   [TOTEM ] Initializing transport (Kronosnet).
11:14:56 corosync[31791]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
11:14:56 corosync[31791]:   [TOTEM ] totemknet initialized
11:14:56 corosync[31791]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
11:14:56 corosync[31791]:   [QB    ] server name: cmap
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration service [1]
11:14:56 corosync[31791]:   [QB    ] server name: cfg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
11:14:56 corosync[31791]:   [QB    ] server name: cpg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
11:14:56 corosync[31791]:   [WD    ] Watchdog not enabled by configuration
11:14:56 corosync[31791]:   [WD    ] resource load_15min missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] resource memory_used missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] no resources configured.
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
11:14:56 corosync[31791]:   [QUORUM] Using quorum provider corosync_votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
11:14:56 corosync[31791]:   [QB    ] server name: votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
11:14:56 corosync[31791]:   [QB    ] server name: quorum
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 has no active links
11:14:56 corosync[31791]:   [TOTEM ] A new membership (2:384) was formed. Members joined: 2
11:14:56 corosync[31791]:   [CPG   ] downlist left_list: 0 received
11:14:56 corosync[31791]:   [QUORUM] Members[1]: 2
11:14:56 corosync[31791]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:14:56 systemd[1]: Started Corosync Cluster Engine.
11:14:57 corosync[31791]:   [KNET  ] rx: host: 1 link: 0 is up
11:14:57 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:57 corosync[31791]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
11:14:57 corosync[31791]:   [KNET  ] pmtud: Global data MTU changed to: 1397

journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)
Code:
11:31:26 corosync[1238]:   [TOTEM ] A new membership (1:5068912) was formed. Members
11:31:26 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:26 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:26 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:28 corosync[1238]:   [TOTEM ] A new membership (1:5068916) was formed. Members
11:31:28 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:28 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:28 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:29 corosync[1238]:   [TOTEM ] A new membership (1:5068920) was formed. Members
11:31:29 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:29 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:29 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:30 corosync[1238]:   [TOTEM ] A new membership (1:5068924) was formed. Members
11:31:30 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:30 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:30 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.

/etc/corosync/corosync.conf, identical on both machines
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: haumea
    nodeid: 1
    quorum_votes: 1
    ring0_addr: haumea.coheris.com
  }
  node {
    name: makemake
    nodeid: 2
    quorum_votes: 1
    ring0_addr: makemake.coheris.com
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: devs-spad
  config_version: 2
  interface {
    bindnetaddr: haumea.coheris.com
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

omping haumea makemake on each node seems to be working
Code:
root@haumea:~# omping haumea makemake
makemake : waiting for response msg
makemake : waiting for response msg
makemake : joined (S,G) = (*, 232.43.211.234), pinging
makemake :   unicast, seq=1, size=69 bytes, dist=0, time=0.269ms
makemake : multicast, seq=1, size=69 bytes, dist=0, time=0.272ms
^C
makemake :   unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.269/0.398/0.570/0.128
makemake : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.272/0.410/0.576/0.125

on haumea, journalctl -b -u pve* gives:
Code:
janv. 09 11:44:06 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:06 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pvesr[9594]: error with cfs lock 'file-replication_cfg': no quorum!
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Failed with result 'exit-code'.
janv. 09 11:44:09 haumea systemd[1]: Failed to start Proxmox VE replication runner.
janv. 09 11:44:10 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!

on makemake, same command:
Code:
see output in first reply, I broke the 10000 chars limit

I don’t know where to look now… Can anyone help ?

Thanks !
 
Last edited:
Sorry, broke the 10000 chars limit

on makemake (failed node), same command (journalctl -b -u pve*):
Code:
10:48:55 systemd[1]: Starting Proxmox VE Login Banner...
10:48:55 systemd[1]: Starting Commit Proxmox VE network changes...
10:48:55 systemd[1]: Starting Proxmox VE firewall logger...
10:48:55 systemd[1]: Started Commit Proxmox VE network changes.
10:48:56 pvefw-logger[696]: starting pvefw logger
10:48:56 systemd[1]: Started Proxmox VE firewall logger.
10:48:56 systemd[1]: Started Proxmox VE replication runner.
10:48:56 systemd[1]: Started Daily PVE download activities.
10:48:59 systemd[1]: Reached target PVE Storage Target.
10:49:00 systemd[1]: Starting Proxmox VE replication runner...
10:49:02 systemd[1]: Starting The Proxmox VE cluster filesystem...
10:49:03 pmxcfs[1101]: [quorum] crit: quorum_initialize failed: 2
10:49:03 pmxcfs[1101]: [quorum] crit: can't initialize service
10:49:03 pmxcfs[1101]: [confdb] crit: cmap_initialize failed: 2
10:49:03 pmxcfs[1101]: [confdb] crit: can't initialize service
10:49:03 pmxcfs[1101]: [dcdb] crit: cpg_initialize failed: 2
10:49:03 pmxcfs[1101]: [dcdb] crit: can't initialize service
10:49:03 pmxcfs[1101]: [status] crit: cpg_initialize failed: 2
10:49:03 pmxcfs[1101]: [status] crit: can't initialize service
10:49:03 systemd[1]: Started Proxmox VE Login Banner.
10:49:04 systemd[1]: Started The Proxmox VE cluster filesystem.
10:49:04 systemd[1]: Starting PVE Status Daemon...
10:49:04 systemd[1]: Starting Proxmox VE firewall...
10:49:06 systemd[1]: Starting PVE API Daemon...
10:49:07 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:07 pve-firewall[1158]: starting server
10:49:07 systemd[1]: Started Proxmox VE firewall.
10:49:07 pvestatd[1166]: starting server
10:49:07 systemd[1]: Started PVE Status Daemon.
10:49:08 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:09 pmxcfs[1101]: [status] notice: update cluster info (cluster name  devs-spad, version = 2)
10:49:09 pmxcfs[1101]: [dcdb] notice: members: 2/1101
10:49:09 pmxcfs[1101]: [dcdb] notice: all data is up to date
10:49:09 pmxcfs[1101]: [status] notice: members: 2/1101
10:49:09 pmxcfs[1101]: [status] notice: all data is up to date
10:49:09 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:10 pvedaemon[1179]: starting server
10:49:10 pvedaemon[1179]: starting 3 worker(s)
10:49:10 pvedaemon[1179]: worker 1180 started
10:49:10 pvedaemon[1179]: worker 1181 started
10:49:10 pvedaemon[1179]: worker 1182 started
10:49:10 systemd[1]: Started PVE API Daemon.
10:49:10 systemd[1]: Starting PVE API Proxy Server...
10:49:10 systemd[1]: Starting PVE Cluster Resource Manager Daemon...
10:49:10 pvesr[966]: trying to acquire cfs lock 'file-replication_cfg' ...
10:49:10 pve-ha-crm[1329]: starting server
10:49:10 pve-ha-crm[1329]: status change startup => wait_for_quorum
10:49:10 systemd[1]: Started PVE Cluster Resource Manager Daemon.
 
1578929634503.png

Result on connecting on makemake:8006 and haumea:8006

I guess this is what is called a “split brain”…

How can I recover from that? What are my options?

Thanks for any help…
 
Hmm, output of corosync-cfgtool -s on both nodes

nodeid 1 is haumea
nodeid 2 is makemake

Code:
root@haumea:~# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
        addr    = 172.16.1.125
        status:
                nodeid  1:      link enabled:1  link connected:1
                nodeid  2:      link enabled:1  link connected:0

Code:
root@makemake:~# corosync-cfgtool -s
Printing link status.
Local node ID 2
LINK ID 0
        addr    = 172.16.1.228
        status:
                nodeid  1:      link enabled:1  link connected:1
                nodeid  2:      link enabled:1  link connected:1

On haumea, link to makemake is enabled but not connected
On makemake, all seems ok

Any clue?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!