Hello
I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.
Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.
As you’re reading this, you can guess it did not!
I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.
pveversion -v on each node gives the exact same version list.
journalctl -b -u corosync.service on failed node (makemake):
	
	
	
		
journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)
	
	
	
		
/etc/corosync/corosync.conf, identical on both machines
	
	
	
		
omping haumea makemake on each node seems to be working
	
	
	
		
on haumea, journalctl -b -u pve* gives:
	
	
	
		
on makemake, same command:
	
	
	
		
I don’t know where to look now… Can anyone help ?
Thanks !
				
			I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.
Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.
As you’re reading this, you can guess it did not!
I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.
pveversion -v on each node gives the exact same version list.
journalctl -b -u corosync.service on failed node (makemake):
		Code:
	
	11:14:56 corosync[31791]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
11:14:56 corosync[31791]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
11:14:56 corosync[31791]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
11:14:56 corosync[31791]:   [MAIN  ] Please migrate config file to nodelist.
11:14:56 corosync[31791]:   [TOTEM ] Initializing transport (Kronosnet).
11:14:56 corosync[31791]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
11:14:56 corosync[31791]:   [TOTEM ] totemknet initialized
11:14:56 corosync[31791]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
11:14:56 corosync[31791]:   [QB    ] server name: cmap
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync configuration service [1]
11:14:56 corosync[31791]:   [QB    ] server name: cfg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
11:14:56 corosync[31791]:   [QB    ] server name: cpg
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
11:14:56 corosync[31791]:   [WD    ] Watchdog not enabled by configuration
11:14:56 corosync[31791]:   [WD    ] resource load_15min missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] resource memory_used missing a recovery key.
11:14:56 corosync[31791]:   [WD    ] no resources configured.
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
11:14:56 corosync[31791]:   [QUORUM] Using quorum provider corosync_votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
11:14:56 corosync[31791]:   [QB    ] server name: votequorum
11:14:56 corosync[31791]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
11:14:56 corosync[31791]:   [QB    ] server name: quorum
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]:   [KNET  ] host: host: 1 has no active links
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]:   [KNET  ] host: host: 2 has no active links
11:14:56 corosync[31791]:   [TOTEM ] A new membership (2:384) was formed. Members joined: 2
11:14:56 corosync[31791]:   [CPG   ] downlist left_list: 0 received
11:14:56 corosync[31791]:   [QUORUM] Members[1]: 2
11:14:56 corosync[31791]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:14:56 systemd[1]: Started Corosync Cluster Engine.
11:14:57 corosync[31791]:   [KNET  ] rx: host: 1 link: 0 is up
11:14:57 corosync[31791]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:57 corosync[31791]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
11:14:57 corosync[31791]:   [KNET  ] pmtud: Global data MTU changed to: 1397journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)
		Code:
	
	11:31:26 corosync[1238]:   [TOTEM ] A new membership (1:5068912) was formed. Members
11:31:26 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:26 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:26 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:28 corosync[1238]:   [TOTEM ] A new membership (1:5068916) was formed. Members
11:31:28 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:28 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:28 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:29 corosync[1238]:   [TOTEM ] A new membership (1:5068920) was formed. Members
11:31:29 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:29 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:29 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service.
11:31:30 corosync[1238]:   [TOTEM ] A new membership (1:5068924) was formed. Members
11:31:30 corosync[1238]:   [CPG   ] downlist left_list: 0 received
11:31:30 corosync[1238]:   [QUORUM] Members[1]: 1
11:31:30 corosync[1238]:   [MAIN  ] Completed service synchronization, ready to provide service./etc/corosync/corosync.conf, identical on both machines
		Code:
	
	logging {
  debug: off
  to_syslog: yes
}
nodelist {
  node {
    name: haumea
    nodeid: 1
    quorum_votes: 1
    ring0_addr: haumea.coheris.com
  }
  node {
    name: makemake
    nodeid: 2
    quorum_votes: 1
    ring0_addr: makemake.coheris.com
  }
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: devs-spad
  config_version: 2
  interface {
    bindnetaddr: haumea.coheris.com
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}omping haumea makemake on each node seems to be working
		Code:
	
	root@haumea:~# omping haumea makemake
makemake : waiting for response msg
makemake : waiting for response msg
makemake : joined (S,G) = (*, 232.43.211.234), pinging
makemake :   unicast, seq=1, size=69 bytes, dist=0, time=0.269ms
makemake : multicast, seq=1, size=69 bytes, dist=0, time=0.272ms
^C
makemake :   unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.269/0.398/0.570/0.128
makemake : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.272/0.410/0.576/0.125on haumea, journalctl -b -u pve* gives:
		Code:
	
	janv. 09 11:44:06 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:06 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pvesr[9594]: error with cfs lock 'file-replication_cfg': no quorum!
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Failed with result 'exit-code'.
janv. 09 11:44:09 haumea systemd[1]: Failed to start Proxmox VE replication runner.
janv. 09 11:44:10 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!on makemake, same command:
		Code:
	
	see output in first reply, I broke the 10000 chars limitI don’t know where to look now… Can anyone help ?
Thanks !
			
				Last edited: 
				
		
	
										
										
											
	
										
									
								 
	 
	