Proxmox restarted unexpectedly

manzke89 · Feb 19, 2024

The remaining question is, if for some reason I lost communication between the 2 nodes, will they reboot?

esi_y · Feb 19, 2024

manzke89 said:
link for log files: https://file.io/WlfZ4AHZctrv

file is blank cat /etc/pve/ha/resources.cfg

Your node 23 got activated CRM on Jan 31 because of vm:2501 was set as HA service, which is when you were testing the HA, I suppose. From the log it does appear it was subject to the watchdog ever since, it also appears the vm was an HA service all along. It was started on node 25 which had LRM subject to the same.

If your ha-manager status shows any active CRM or LRM, they are definitely going to reboot if you lose quorum again for a while. And I do not think they are all idle from what I have seen, despite your HA config is empty, which is interesting.

When was it that you think you removed the HA service originally?

manzke89 · Feb 19, 2024

tempacc346235 said:
When was it that you think you removed the HA service originally?

today in the morning

esi_y · Feb 20, 2024

manzke89 said:
today in the morning

The logs you have provided before went till Feb 19 10:33:38 mdc-023, so ... was it before or after that you removed it?

Also, what does ha-manager status show now (does not matter on which node)?

manzke89 · Feb 20, 2024

tempacc346235 said:
The logs you have provided before went till Feb 19 10:33:38 mdc-023, so ... was it before or after that you removed it?

before

tempacc346235 said:
Also, what does ha-manager status show now (does not matter on which node)?

root@mdc-023:~# ha-manager status
quorum OK
master mdc-023 (active, Tue Feb 20 11:27:31 2024)
lrm mdc-022 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc-023 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc-025 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc024 (idle, Tue Feb 20 11:27:36 2024)

esi_y · Feb 20, 2024

manzke89 said:
before

Alright!

So ...

Code:

Feb 19 09:46:45 mdc-023 pve-ha-crm[11895]: removing stale service 'vm:2501' (no config)

If I get it right, you had this one forgotten HA machine there. Your reboots happened prior to this and they happened because of the quorum hiccups. Afterwards you removed the last HA service (yesterday morning).

This now makes more sense:

manzke89 said:

Code:

root@mdc-023:~# ha-manager status
quorum OK
master mdc-023 (active, Tue Feb 20 11:27:31 2024)
lrm mdc-022 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc-023 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc-025 (idle, Tue Feb 20 11:27:36 2024)
lrm mdc024 (idle, Tue Feb 20 11:27:36 2024)

So beyond the known bug of the "dangling" CRM [1] (on a setup that previously used HA), the nodes now should not reboot anymore even if quorum is wonky.

I say "should" because it's a bit more complicated - it's bit lengthy and addressed in another post of mine [2].

Long story short, safest for you now is probably to reboot the mdc-023, double-check with ha-manager status they are ALL idle and not set any service up as HA again. After that, upon lost quorum, you definitely "should" not be getting reboots.

I say "should" again because the more drastic solution is at the end of the referred post above [2] for those that want to be absolutely sure.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2] https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602

manzke89 · Feb 20, 2024

tempacc346235 said:
Alright!

So ...

Code:

Feb 19 09:46:45 mdc-023 pve-ha-crm[11895]: removing stale service 'vm:2501' (no config)

If I get it right, you had this one forgotten HA machine there. Your reboots happened prior to this and they happened because of the quorum hiccups. Afterwards you removed the last HA service (yesterday morning).

This now makes more sense:

So beyond the known bug of the "dangling" CRM [1] (on a setup that previously used HA), the nodes now should not reboot anymore even if quorum is wonky.

I say "should" because it's a bit more complicated - it's bit lengthy and addressed in another post of mine [2].

Long story short, safest for you now is probably to reboot the mdc-023, double-check with ha-manager status they are ALL idle and not set any service up as HA again. After that, upon lost quorum, you definitely "should" not be getting reboots.

I say "should" again because the more drastic solution is at the end of the referred post above [2] for those that want to be absolutely sure.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2] https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602

I understand, thank you for your attention to my problem

Search

Search

Proxmox restarted unexpectedly

manzke89

Member

esi_y

Renowned Member

manzke89

Member

esi_y

Renowned Member

manzke89

Member

esi_y

Renowned Member

manzke89

Member