VMs won't start if a three-node cluster rebooted nearly at the same time - no quorum

Attila

Active Member
Jun 15, 2016
13
1
41
Hi,

A few days ago all nodes of our three-node cluster restarted almost at once. We are still looking for the casuse - pparently there was a DDos attack on a server in the same VLAN, but we are still puzzled why the nodes were rebooted (no log entries that would help).

The main problem is, however something else. Once all nodes were up, NONE of the VMs started.

In the pve logs, I see that startall failed because there was no quorum.

As I see from previous posts, quorum is mandatory for the VMs to start, but if it is so, why there is no retry in 1-2 minute intervals? Quorum was ready in a few minutes, but all VMs were down.

Same issue with the HA VMs (there are 4 of them). These have failed because sheepdog was not ready. I see no sheepdog errors BTW.

We are building a HA system, but this issue has ruined our expectations :(

Our questions are:

- How can we make the cluster start the VMs even if the first attempt fails for no-quorum reason?
- What caused the sheepdog error, why wasn't the sheepdog ready once the cluster was up, and why the HA VMs also failed to start?
- Are you aware of any issues, did you see any cases when all three nodes restarted nearly at the same time? (We have no HW watchdog, but in the logs I have seen some software watchdog entries).


Thanks a lot!

Attila
 
Thanks Dietmar.

I was aware of this, but there is really no better way than to create a mechanism that will periodically issue the "
systemctl start pve-manager" command after the start of the server (for e.g. 1 hour)? I don't find this workaround very nice. Couldn't we change the 60 sec wait time to something larger?

Any hints on the other two questions?
 
Hi Dietmar,

Interestingly enough, one of our engineers has already put this startup dependency in the systemd back in March, this is how it looks:

cat sheepdog.service
[Unit]
Description=Sheepdog QEMU/KVM Block Storage
After=network.target corosync.service
Wants=syslog.target
ConditionFileIsExecutable=/usr/sbin/sheep

[Service]
EnvironmentFile=-/etc/default/sheepdog
ExecStart=/usr/lib/sheepdog/sheepdog-start-wrapper
Type=forking
Restart=on-abort
StartLimitInterval=10s
StartLimitBurst=3
LimitNOFILE=infinity

[Install]
WantedBy=multi-user.target


But nevertheless we had that issue at startup... So there must be something else as well.

When do you plan to release pve-sheepdog 1.0?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!