Hello
I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.
Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.
As you’re reading this, you can guess it did not!
I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.
pveversion -v on each node gives the exact same version list.
journalctl -b -u corosync.service on failed node (makemake):
journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)
/etc/corosync/corosync.conf, identical on both machines
omping haumea makemake on each node seems to be working
on haumea, journalctl -b -u pve* gives:
on makemake, same command:
I don’t know where to look now… Can anyone help ?
Thanks !
I’ve been running a two-node cluster (haumea and makemake, yes I’m into transneptunian objects) for quite some time now, without any problem.
Lately, I noticed that on the web interface, one node (makemake) was marked failed, but the containers inside it still worked, so I kept postponing the digging into that problem.
Yesterday, I restarted the failing node (makemake), hoping, as it happens so often, that that trivial move would solve everything.
As you’re reading this, you can guess it did not!
I can’t see why the cluster is not up. Of course, I can not start the containers on the failed node now, since the cluster is not ready by lack of quorum.
I don’t feel like chancing a reboot of the working node (haumea), I really dont wan’t to lose the containers on that node, I’ve learnt my lesson.
pveversion -v on each node gives the exact same version list.
journalctl -b -u corosync.service on failed node (makemake):
Code:
11:14:56 corosync[31791]: [MAIN ] Corosync Cluster Engine 3.0.2-dirty starting up
11:14:56 corosync[31791]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
11:14:56 corosync[31791]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
11:14:56 corosync[31791]: [MAIN ] Please migrate config file to nodelist.
11:14:56 corosync[31791]: [TOTEM ] Initializing transport (Kronosnet).
11:14:56 corosync[31791]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
11:14:56 corosync[31791]: [TOTEM ] totemknet initialized
11:14:56 corosync[31791]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync configuration map access [0]
11:14:56 corosync[31791]: [QB ] server name: cmap
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync configuration service [1]
11:14:56 corosync[31791]: [QB ] server name: cfg
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
11:14:56 corosync[31791]: [QB ] server name: cpg
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync profile loading service [4]
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
11:14:56 corosync[31791]: [WD ] Watchdog not enabled by configuration
11:14:56 corosync[31791]: [WD ] resource load_15min missing a recovery key.
11:14:56 corosync[31791]: [WD ] resource memory_used missing a recovery key.
11:14:56 corosync[31791]: [WD ] no resources configured.
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync watchdog service [7]
11:14:56 corosync[31791]: [QUORUM] Using quorum provider corosync_votequorum
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
11:14:56 corosync[31791]: [QB ] server name: votequorum
11:14:56 corosync[31791]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
11:14:56 corosync[31791]: [QB ] server name: quorum
11:14:56 corosync[31791]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]: [KNET ] host: host: 1 has no active links
11:14:56 corosync[31791]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]: [KNET ] host: host: 1 has no active links
11:14:56 corosync[31791]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:56 corosync[31791]: [KNET ] host: host: 1 has no active links
11:14:56 corosync[31791]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0)
11:14:56 corosync[31791]: [KNET ] host: host: 2 has no active links
11:14:56 corosync[31791]: [TOTEM ] A new membership (2:384) was formed. Members joined: 2
11:14:56 corosync[31791]: [CPG ] downlist left_list: 0 received
11:14:56 corosync[31791]: [QUORUM] Members[1]: 2
11:14:56 corosync[31791]: [MAIN ] Completed service synchronization, ready to provide service.
11:14:56 systemd[1]: Started Corosync Cluster Engine.
11:14:57 corosync[31791]: [KNET ] rx: host: 1 link: 0 is up
11:14:57 corosync[31791]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
11:14:57 corosync[31791]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
11:14:57 corosync[31791]: [KNET ] pmtud: Global data MTU changed to: 1397
journalctl -b -u corosync.service on still working node (haumea) (end of display only, the same lines are constantly repeated)
Code:
11:31:26 corosync[1238]: [TOTEM ] A new membership (1:5068912) was formed. Members
11:31:26 corosync[1238]: [CPG ] downlist left_list: 0 received
11:31:26 corosync[1238]: [QUORUM] Members[1]: 1
11:31:26 corosync[1238]: [MAIN ] Completed service synchronization, ready to provide service.
11:31:28 corosync[1238]: [TOTEM ] A new membership (1:5068916) was formed. Members
11:31:28 corosync[1238]: [CPG ] downlist left_list: 0 received
11:31:28 corosync[1238]: [QUORUM] Members[1]: 1
11:31:28 corosync[1238]: [MAIN ] Completed service synchronization, ready to provide service.
11:31:29 corosync[1238]: [TOTEM ] A new membership (1:5068920) was formed. Members
11:31:29 corosync[1238]: [CPG ] downlist left_list: 0 received
11:31:29 corosync[1238]: [QUORUM] Members[1]: 1
11:31:29 corosync[1238]: [MAIN ] Completed service synchronization, ready to provide service.
11:31:30 corosync[1238]: [TOTEM ] A new membership (1:5068924) was formed. Members
11:31:30 corosync[1238]: [CPG ] downlist left_list: 0 received
11:31:30 corosync[1238]: [QUORUM] Members[1]: 1
11:31:30 corosync[1238]: [MAIN ] Completed service synchronization, ready to provide service.
/etc/corosync/corosync.conf, identical on both machines
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: haumea
nodeid: 1
quorum_votes: 1
ring0_addr: haumea.coheris.com
}
node {
name: makemake
nodeid: 2
quorum_votes: 1
ring0_addr: makemake.coheris.com
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: devs-spad
config_version: 2
interface {
bindnetaddr: haumea.coheris.com
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
omping haumea makemake on each node seems to be working
Code:
root@haumea:~# omping haumea makemake
makemake : waiting for response msg
makemake : waiting for response msg
makemake : joined (S,G) = (*, 232.43.211.234), pinging
makemake : unicast, seq=1, size=69 bytes, dist=0, time=0.269ms
makemake : multicast, seq=1, size=69 bytes, dist=0, time=0.272ms
^C
makemake : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.269/0.398/0.570/0.128
makemake : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.272/0.410/0.576/0.125
on haumea, journalctl -b -u pve* gives:
Code:
janv. 09 11:44:06 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:06 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:07 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pvesr[9594]: trying to acquire cfs lock 'file-replication_cfg' ...
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:08 haumea pveproxy[8625]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
janv. 09 11:44:09 haumea pvesr[9594]: error with cfs lock 'file-replication_cfg': no quorum!
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
janv. 09 11:44:09 haumea systemd[1]: pvesr.service: Failed with result 'exit-code'.
janv. 09 11:44:09 haumea systemd[1]: Failed to start Proxmox VE replication runner.
janv. 09 11:44:10 haumea pveproxy[7272]: Cluster not quorate - extending auth key lifetime!
on makemake, same command:
Code:
see output in first reply, I broke the 10000 chars limit
I don’t know where to look now… Can anyone help ?
Thanks !
Last edited: