Fingerprint error

Hallo,

wir haben zwei Proxmox Mail Gateways die miteinander geclustert sind. Nun kommt es in regelmäßigen Abständen (9-11 Wochen), das der Node einen Fingerprint Error wirft. Es ist zwar kein Problem den Fehler zu beheben, nur macht mich die Häufigkeit stutzig. Wir haben am Node keine Subskription, nur am Master, an beiden verwenden wir Let's Encrypt Zertifikate. Hardware ist bei beiden Baugleich.

Woran kann es liegen?
 
wird das integrierte letsencrypt/acme feature verwendet? wenn nein, ist das normal, siehe:

https://pmg.proxmox.com/pmg-docs/pmgconfig.1.html

Change Certificate for Cluster Setups​

If you change the API certificate of an active cluster node manually, you also need to update the pinned fingerprint inside the cluster configuration.
You can do that by executing the following command on the host where the certificate changed:
pmgcm update-fingerprints
Note, this will be done automatically if using the integrated ACME (for example, through Let’s Encrypt) feature.
 
Ja wir benutzen die im Mail Gateway integrierten ACME Funktion per DNS Challenge
 
ok, welche version vom pmg wird denn verwendet? (pmgversion -v)
und kannst du das journal vom zeitraum posten wo es passiert ist (idealerweise wenn das zertifikat ausgetauscht wurde via acme)
 
Auf Master und Node gleiche Ausgahe:
Code:
proxmox-mailgateway: 7.3-1 (API: 7.3-3/a3d66da0, running kernel: 5.15.104-1-pve)
pmg-api: 7.3-3
pmg-gui: 3.3-2
pve-kernel-5.15: 7.4-1
pve-kernel-helper: 7.3-8
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
clamav-daemon: 0.103.8+dfsg-0+deb11u1
ifupdown2: 3.1.0-1+pmx3
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.4
libproxmox-acme-plugins: 1.4.4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-http-server-perl: 4.2-3
libxdgmime-perl: 1.0-1
lvm2: 2.03.11-2.1
pmg-docs: 7.3-2
pmg-i18n: 2.12-1
pmg-log-tracker: 2.3.2-1
postgresql-13: 13.9-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-spamassassin: 4.0.0-2
proxmox-widget-toolkit: 3.6.5
pve-firmware: 3.6-4
pve-xtermjs: 4.16.0-1
zfsutils-linux: 2.1.9-pve1

Und das Journal (anonymisiert):
Code:
Loading ACME account details
Placing ACME order
Order URL: https://acme-v02.api.letsencrypt.org/acme/order/328739300/103301168426

Getting authorization details from 'https://acme-v02.api.letsencrypt.org/acme/authz-v3/126351415636'
The validation for xxxxx.yyyyyyyyyyy.zz is pending!
[Wed May  3 02:47:49 CEST 2023] Adding record
[Wed May  3 02:47:49 CEST 2023] Record added, OK
Add TXT record: _acme-challenge.xxxxx.yyyyyyyyyyy.zz
Sleeping 30 seconds to wait for TXT record propagation
Triggering validation
Sleeping for 5 seconds
Status is 'valid', domain 'xxxxx.yyyyyyyyyyy.zz' OK!
[Wed May  3 02:48:40 CEST 2023] Record deleted
Remove TXT record: _acme-challenge.xxxxx.yyyyyyyyyyy.zz

All domains validated!

Creating CSR
Checking order status
Order is ready, finalizing order
valid!

Downloading certificate
Setting custom certificate file /etc/pmg/pmg-api.pem
Restarting pmgproxy
Notify cluster about new fingerprint

TASK ERROR: 500 update fingerprints failed: unable to get remote node fingerprint from 'xxxxx': command 'ssh -l root -o 'BatchMode=yes' -o 'HostKeyAlias=xxxxx' 111.222.333.444 'openssl x509 -noout -fingerprint -sha256 -in /etc/pmg/pmg-api.pem'' failed: exit code 255
 
Last edited:
Die Datei /etc/pmg/cluster.conf hat Owner und Group root und Rechte 644
 
Last edited:
von welchem node (master/slave?) ist das tasklog ?

edit: und ist der output wenn man auf der node das kommando manuell ausführt?:
Code:
openssl x509 -noout -fingerprint -sha256 -in /etc/pmg/pmg-api.pem
 
Last edited:
Der ist vom Node (Slave).

Der hier ist vom Master
Code:
Loading ACME account details
Placing ACME order
Order URL: https://acme-v02.api.letsencrypt.org/acme/order/328739300/171746259677

Getting authorization details from 'https://acme-v02.api.letsencrypt.org/acme/authz-v3/213246862707'
The validation for vvvvvv.yyyyyyyyyyy.zz is pending!
[Thu Mar 23 05:47:26 CET 2023] Adding record
[Thu Mar 23 05:47:26 CET 2023] Record added, OK
Add TXT record: _acme-challenge.vvvvvv.yyyyyyyyyyy.zz
Sleeping 30 seconds to wait for TXT record propagation
Triggering validation
Sleeping for 5 seconds
Status is 'valid', domain 'vvvvvv.yyyyyyyyyyy.zz' OK!
[Thu Mar 23 05:48:08 CET 2023] Record deleted
Remove TXT record: _acme-challenge.vvvvvv.yyyyyyyyyyy.zz

All domains validated!

Creating CSR
Checking order status
Order is ready, finalizing order
valid!

Downloading certificate
Setting custom certificate file /etc/pmg/pmg-api.pem
Restarting pmgproxy
Notify cluster about new fingerprint

TASK ERROR: 500 update fingerprints failed: unable to get remote node fingerprint from 'xxxxxxx': command 'ssh -l root -o 'BatchMode=yes' -o 'HostKeyAlias=pmg2' 111.222.333.444 'openssl x509 -noout -fingerprint -sha256 -in /etc/pmg/pmg-api.pem'' failed: exit code 255
 
We also see the error when trying to avoid the cluster losing sync with members.

Bash:
root@pmg-node1:~# pmgcm update-fingerprints
500 update fingerprints failed: unable to get remote node fingerprint from 'pmg-node2': command 'ssh -l root -o 'BatchMode=yes' -o 'HostKeyAlias=pmg-node2' 111.222.333.444 'openssl x509 -noout -fingerprint -sha256 -in /etc/pmg/pmg-api.pem'' failed: exit code 255
root@pmg-node1:~#

If they can communicate when fingerprints are manually updated, why the error?
 
If they can communicate when fingerprints are manually updated, why the error?
the error is from `ssh` - so the issue might have nothing to do with the api-proxy server...

does:
`ssh -l root -o 'BatchMode=yes' -o 'HostKeyAlias=pmg-node2' 111.222.333.444`
work when run manually on the CLI?
 
work when run manually on the CLI?
Host key verification failed.

So more than fingerprints need updating in order to retain connectivity?

What needs updating? And will leveraging update-fingerprints on-cron then be sufficient to avoid comms breakage?

Otherwise it tends to lose connection to other members every month or two, and we have to get the pmg-api.pem SHA and update cluster.conf etc
 
PMG's cluster stack uses ssh and the api for communication - thus both need to be updated.
However - the ssh-host-keys usually do not change once set up....

The question is why did your host-key change?

apart from that running `update-fingerprints` should be enough (it synchronizes the API-certificate fingerprints)
 
The question is why did your host-key change?
Either cloud-init or provisioning scripts not being fully disabled/deleted. Will poke around to figure out why it regenerated.

apart from that running `update-fingerprints` should be enough (it synchronizes the API-certificate fingerprints)
Thanks for that. Will sort the extra root cause then focus back on getting this working well. :)

Should it need cron-jobbing, or should the SSL/API fingerprints auto-update in each cluster.conf?

UPDATE:

Some servers had a provisioning script enabled - now fixed. But that doesn't apply to the filters...

With these, it seems there's no cloud-init drive nor provisioning scripts, so I'm as yet unsure re: why.
 
Last edited:
Should it need cron-jobbing, or should the SSL/API fingerprints auto-update in each cluster.conf?
Not sure I understand the question...
The fingerprints need updating when the API-ssl certificate changes a node in a cluster.
This happens when you change the certificate of this node.
If you use PMG's ACME client then the update-fingerprints is called upon certificate renewal automatically
In all other cases you need to call it after changing the certificate.
 
If you use PMG's ACME client then the update-fingerprints is called upon certificate renewal automatically
Thank you for this, I think we're good then.

The update-fingerprints run went through OK on both hosts after clearing known hosts and running a connection firstly from each.
 
The question is why did your host-key change?
I think your PVE 8.0 changelog may have the fix @Stoiko Ivanov - as we are on 7.4 when experiencing this:

- cloud-init: If the VM name is not a FQDN and no DNS search domain is configured, the automatically-generated cloud-init user data now contains an additional fqdn option. This fixes an issue where the hostname was not set properly for some in-guest distributions. However, the changed user data will change the instance ID, which may cause the in-guest cloud-init to re-run actions that trigger once-per-instance. For example, it may regenerate the in-guest SSH host keys.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!