watchdog - pve-ha-lrm fails after every reboot

cjdnad

Member
Jul 21, 2022
40
2
13
In my 3 node cluster with HA I have an annoying issue with the main node.
In any scenario where it is rebooted the watchdog and pve-ha-lrm fails to start again every time, so I need to perform a modprobe softdog and systemctl start watchdog-mux.service.
I add the watchdog=0 in the grub but that made zero difference.
Can someone shed any light on this please?

Quorum status and ceph all healthy each time.

Thanks
 
Can you provide the dmesg and the journal since the last (failed) boot (journalctl -b > journal.txt)?
Please also provide the output of pveversion -v.
 
  • Like
Reactions: cjdnad
I just commented watchdog from /etc/default/pve-ha-manager as this was uncommented before.

So now it is as follows:
# select watchdog module (default is softdog)
#WATCHDOG_MODULE=ipmi_watchdog

Rebooted and the watchdog started up on its own successfully now!


root@pve:~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.1-1
proxmox-backup-file-restore: 2.3.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 
Glad you've resolved it!
So you had the ipmi watchdog configured? Do you even have one available?
 
Glad you've resolved it!
So you had the ipmi watchdog configured? Do you even have one available?
no that was my mistake - I'll do some further testing to ensure all is working as expected with the other nodes also
 
Hitting another obstacle after attempting to manually migrate a CT
I can ssh to both nodes from the shell but migration spits out a public key error.
ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
 
Make sure the public key of the source node is in the `authorized_keys` on the target host.
You can update the list of `authorized_keys` with `pvecm updatecerts`. Run it on all nodes in the cluster.
 
  • Like
Reactions: cjdnad
Make sure the public key of the source node is in the `authorized_keys` on the target host.
You can update the list of `authorized_keys` with `pvecm updatecerts`. Run it on all nodes in the cluster.
I did an update and upgrade first before I saw your message. Then ran this command on all nodes. Maybe my mistake before was just running it on 1 node.
Another great solution! Migration works!


Many thanks
 
I am testing the entire cluster system with some hdd drives for the ceph osd disks. No 10giga network either. Only local 1giga switch.
It works albeit a bit slow.
Would using SSD drives for the ceph pool help speed the migration process up or is it the network creating the bottleneck?
Thanks
 
How many HDDs do you have per node?
 
How many HDDs do you have per node?
Just 1 hdd per node at the moment (apart from the disk for the local).
Add another ssd for each node perhaps? I only need 120GB or 256GB disks for my environment so they are fairly cheap.
 
You always want to use datacenter SSDs and HDDs for Ceph.

Performance of a 3 node cluster with only a single HDD will not be great. You'd be better off just using the HDDs for local storage.
Ceph scales well with additional nodes and additional OSDs. Please take a look at our benchmark paper for recommendations: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/
Thanks. Not really fussed to have a top notch production system.
Just does the job, having redundancy for my homeassistant/mqtt/nginx etc. All lightweight lxc.
I'll chuck in 3 more ssd's, they are only 18 USD a pop. Can be replaced cheaply.
I am looking at getting some 10g nics though.
these came to mind

https://www.servermonkey.com/hp-561...pPhaXCX7SRQuH91Bvv9zIa0ckxxpUcgsRzdHPWc-Y70xE
 
Made some improvements. All working good I believe. Using the 2 ssd's as a cache with rules, hdd disks as a cold pool, ssd pool as hot. Latency appears far better using these rules. Migrations occur much faster.
I know 4 nodes is not ideal (will get 5 eventually), the 4th is just a backup node with no vote.

1670859763898.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!