VMs won't start if a three-node cluster rebooted nearly at the same time - no quorum

Attila · Sep 8, 2016

Hi,

A few days ago all nodes of our three-node cluster restarted almost at once. We are still looking for the casuse - pparently there was a DDos attack on a server in the same VLAN, but we are still puzzled why the nodes were rebooted (no log entries that would help).

The main problem is, however something else. Once all nodes were up, NONE of the VMs started.

In the pve logs, I see that startall failed because there was no quorum.

As I see from previous posts, quorum is mandatory for the VMs to start, but if it is so, why there is no retry in 1-2 minute intervals? Quorum was ready in a few minutes, but all VMs were down.

Same issue with the HA VMs (there are 4 of them). These have failed because sheepdog was not ready. I see no sheepdog errors BTW.

We are building a HA system, but this issue has ruined our expectations

Our questions are:

- How can we make the cluster start the VMs even if the first attempt fails for no-quorum reason?
- What caused the sheepdog error, why wasn't the sheepdog ready once the cluster was up, and why the HA VMs also failed to start?
- Are you aware of any issues, did you see any cases when all three nodes restarted nearly at the same time? (We have no HW watchdog, but in the logs I have seen some software watchdog entries).

Thanks a lot!

Attila

dietmar · Sep 9, 2016

Attila said:
In the pve logs, I see that startall failed because there was no quorum.

Please read http://pve.proxmox.com/wiki/Cluster_Manager (section 'Cluster Cold Start')

Attila · Sep 9, 2016

Thanks Dietmar.

I was aware of this, but there is really no better way than to create a mechanism that will periodically issue the "
systemctl start pve-manager" command after the start of the server (for e.g. 1 hour)? I don't find this workaround very nice. Couldn't we change the 60 sec wait time to something larger?

Any hints on the other two questions?

dietmar · Sep 9, 2016

HA manager can start the VMs for you ...

dietmar · Sep 9, 2016

OK, we can simply wait forever - I have commited a patch here:

https://git.proxmox.com/?p=pve-manager.git;a=commitdiff;h=86df246d4145c59e2cdd3bc8ca4afbc2d4ca288e

Attila · Sep 9, 2016

Thanks for the patch.

Regarding HA: as I wrote we have some HA servers as well, they did not start either

What about the other two questions?

dietmar · Sep 9, 2016

Attila said:
These have failed because sheepdog was not ready. I see no sheepdog errors BTW.

AFAIR there was a missing startup dependency:

https://git.proxmox.com/?p=pve-sheepdog.git;a=commitdiff;h=d1b01c8db2aae68bb3a727c6039d2844c7a8800a

will be fixed with next package upload (pve-sheepdog 1.0,.0-1)

Attila · Sep 19, 2016

Hi Dietmar,

Interestingly enough, one of our engineers has already put this startup dependency in the systemd back in March, this is how it looks:

cat sheepdog.service
[Unit]
Description=Sheepdog QEMU/KVM Block Storage
After=network.target corosync.service
Wants=syslog.target
ConditionFileIsExecutable=/usr/sbin/sheep

[Service]
EnvironmentFile=-/etc/default/sheepdog
ExecStart=/usr/lib/sheepdog/sheepdog-start-wrapper
Type=forking
Restart=on-abort
StartLimitInterval=10s
StartLimitBurst=3
LimitNOFILE=infinity

[Install]
WantedBy=multi-user.target

But nevertheless we had that issue at startup... So there must be something else as well.

When do you plan to release pve-sheepdog 1.0?

Search

Search

VMs won't start if a three-node cluster rebooted nearly at the same time - no quorum

Attila

Renowned Member

dietmar

Proxmox Staff Member

Attila

Renowned Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

Attila

Renowned Member

dietmar

Proxmox Staff Member

Attila

Renowned Member

We value your privacy