watchdog - pve-ha-lrm fails after every reboot

cjdnad · Dec 5, 2022

In my 3 node cluster with HA I have an annoying issue with the main node.
In any scenario where it is rebooted the watchdog and pve-ha-lrm fails to start again every time, so I need to perform a modprobe softdog and systemctl start watchdog-mux.service.
I add the watchdog=0 in the grub but that made zero difference.
Can someone shed any light on this please?

Quorum status and ceph all healthy each time.

Thanks

mira · Dec 5, 2022

Can you provide the dmesg and the journal since the last (failed) boot (journalctl -b > journal.txt)?
Please also provide the output of pveversion -v.

cjdnad · Dec 5, 2022

I just commented watchdog from /etc/default/pve-ha-manager as this was uncommented before.

So now it is as follows:
# select watchdog module (default is softdog)
#WATCHDOG_MODULE=ipmi_watchdog

Rebooted and the watchdog started up on its own successfully now!

root@pve:~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.1-1
proxmox-backup-file-restore: 2.3.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

mira · Dec 5, 2022

Glad you've resolved it!
So you had the ipmi watchdog configured? Do you even have one available?

cjdnad · Dec 5, 2022

mira said:
Glad you've resolved it!
So you had the ipmi watchdog configured? Do you even have one available?

no that was my mistake - I'll do some further testing to ensure all is working as expected with the other nodes also

cjdnad · Dec 5, 2022

Hitting another obstacle after attempting to manually migrate a CT
I can ssh to both nodes from the shell but migration spits out a public key error.

ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key

mira · Dec 5, 2022

Make sure the public key of the source node is in the `authorized_keys` on the target host.
You can update the list of `authorized_keys` with `pvecm updatecerts`. Run it on all nodes in the cluster.

cjdnad · Dec 5, 2022

mira said:
Make sure the public key of the source node is in the `authorized_keys` on the target host.
You can update the list of `authorized_keys` with `pvecm updatecerts`. Run it on all nodes in the cluster.

I did an update and upgrade first before I saw your message. Then ran this command on all nodes. Maybe my mistake before was just running it on 1 node.
Another great solution! Migration works!

Many thanks

cjdnad · Dec 5, 2022

I am testing the entire cluster system with some hdd drives for the ceph osd disks. No 10giga network either. Only local 1giga switch.
It works albeit a bit slow.
Would using SSD drives for the ceph pool help speed the migration process up or is it the network creating the bottleneck?
Thanks

mira · Dec 6, 2022

How many HDDs do you have per node?

cjdnad · Dec 6, 2022

mira said:
How many HDDs do you have per node?

Just 1 hdd per node at the moment (apart from the disk for the local).
Add another ssd for each node perhaps? I only need 120GB or 256GB disks for my environment so they are fairly cheap.

mira · Dec 6, 2022

You always want to use datacenter SSDs and HDDs for Ceph.

Performance of a 3 node cluster with only a single HDD will not be great. You'd be better off just using the HDDs for local storage.
Ceph scales well with additional nodes and additional OSDs. Please take a look at our benchmark paper for recommendations: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

cjdnad · Dec 6, 2022

mira said:
You always want to use datacenter SSDs and HDDs for Ceph.

Performance of a 3 node cluster with only a single HDD will not be great. You'd be better off just using the HDDs for local storage.
Ceph scales well with additional nodes and additional OSDs. Please take a look at our benchmark paper for recommendations: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Thanks. Not really fussed to have a top notch production system.
Just does the job, having redundancy for my homeassistant/mqtt/nginx etc. All lightweight lxc.
I'll chuck in 3 more ssd's, they are only 18 USD a pop. Can be replaced cheaply.
I am looking at getting some 10g nics though.
these came to mind

https://www.servermonkey.com/hp-561...pPhaXCX7SRQuH91Bvv9zIa0ckxxpUcgsRzdHPWc-Y70xE

cjdnad · Dec 12, 2022

Made some improvements. All working good I believe. Using the 2 ssd's as a cache with rules, hdd disks as a cold pool, ssd pool as hot. Latency appears far better using these rules. Migrations occur much faster.
I know 4 nodes is not ideal (will get 5 eventually), the 4th is just a backup node with no vote.

watchdog - pve-ha-lrm fails after every reboot

cjdnad

Member

mira

Proxmox Staff Member

cjdnad

Member

mira

Proxmox Staff Member

cjdnad

Member

cjdnad

Member

mira

Proxmox Staff Member

cjdnad

Member

cjdnad

Member

mira

Proxmox Staff Member

cjdnad

Member

mira

Proxmox Staff Member

cjdnad

Member

cjdnad

Member

We value your privacy