What are the best practices for Certificate generation?

TimRyan

Member
Aug 24, 2022
41
3
13
Kaslo BC Canada
comxpertise.ca
After reading the documentation on https://pve.proxmox.com/wiki/Certificate_Management and also running into issues with certificate problems with late added or deleted nodes on an existing cluster I need to understand why the standard cluster treatment is defaulted to, and why externally generated and renewed ACME certificates are not used by default.

In my experience when a cluster is built from scratch with all the hardware on the same local sub net it works very well and it is very stable. The problems I have encountered have all been related to adding node hardware after the initial cluster is defined and replacing existing nodes with new hardware and an existing node name. This leads me to ask these questions;

Is there a compelling reason that an external certificate registry like LetsEncrypt is not used by default?
Are there systemic certificate rebuild commands that can render a damaged cert file set easily for an entire cluster or federated group of clusters?
 
Hi,
I am not sure I fully understand the question, but Proxmox VE uses a self-signed certificate (signed by a certificate authority on the cluster) in order to provide a TLS encrypted communication via the API server, as described in the wiki article you linked to. This is used by default, as it can be setup without further user interaction.

Adding an external certificate authority to generate certificates requires additional setup of the corresponding acme account and plugins, but can be easily achieved via the WebUI. The certificates managed by acme are then auto renewed by the pve-daily-update.service systemd service.

You can list the nodes certificate infomation also via pvenode cert info.

he problems I have encountered have all been related to adding node hardware after the initial cluster is defined and replacing existing nodes with new hardware and an existing node name. This leads me to ask these questions;
What issues did you encounter exactly? Where these related to the certificates or rather to the corosync setup?
 
Hi,
I am not sure I fully understand the question, but Proxmox VE uses a self-signed certificate (signed by a certificate authority on the cluster) in order to provide a TLS encrypted communication via the API server, as described in the wiki article you linked to. This is used by default, as it can be setup without further user interaction.

Adding an external certificate authority to generate certificates requires additional setup of the corresponding acme account and plugins, but can be easily achieved via the WebUI. The certificates managed by acme are then auto renewed by the pve-daily-update.service systemd service.

You can list the nodes certificate infomation also via pvenode cert info.


What issues did you encounter exactly? Where these related to the certificates or rather to the corosync setup?
Hi Chris
I have read the documentation on multiple occasions trying to resolve problems with what appears to me to be;

1. A patchy treatment to the internal certificate process that can be broken and can be difficult to fully repair
2. A less than well documented practice for the use and deployment of ACME certs.
3 How or whether ACME certs and self generated certs should be mixed or only one type deployed

Let me state that I have spent much of the last two years researching cluster platforms for edge cloud construction. There is a lot to like in what has been achieved with ProxMox. However, for the record, as the digital systems landscape is built going forward, secure, stable, reliable and extensible certification will be a critical feature not only for the internal operational reliability of the clusters, but also for the cross site and integration of communications with users accessing cluster hosted applications.

I started my evaluation of ProxMox on an early version of PVE 7 and had a cluster under way when PVE 8 arrived and was adding servers to the cluster and upgrading as I went. The transition from 7 to 8 was not painless, and with adding and removing servers I managed to break the certification system. This was not a complete failure, but a failure of reliability in some functions. Case in point, with some nodes, and I have 8 nodes several of the newer ones that were built on PVE 8 and had Ubuntu 22.04.3 VM's built on them that were unable to connect to the VNC console by way of failure of the host node to recognize the VM certs. Rebuilt nodes also displayed cert failures, not always, and often for reasons that were not obvious.

In my opinion, if a node has been certified and is being "managed" by the cluster host then it should up to the cluster host to resolve any and all cert issues transparently at the OS level by way of the web API for all of its attached VM's.

This also needs to be addressed systemically in a distributed cluster environment. My next project in my ProxMox evaluation is a set of physically separated but fiber linked connected host sites. The sites are single 42U racks with power support and rack servers running the latest PVE 8 code with the physical sites being connected on a site LAN to a Layer 3 Fiber Switch with a distributed DHCP relay at the router level maintaining the internal cluster LANs on two test sites and the external internet IP's assigned by the network operators and reverse proxy routing of HTTPS services and static IP routing of VM's hosting NAS, VPN access, and SMTP services.

High reliability is a key issue and certificate failures must be resolved in real time and reported via SNMP to a systemic control system.

Where I sit at the moment is that the certs issue is the last critical weak point that I need to resolve to my satisfaction before I commit to this platform. There's lots to like, but there are several points of failure that are not obvious until they fail.
 
Last edited:
Node certificates are handled in a rather straight forward way. The self-signed certificates generated by the node itself are located under /etc/pve/nodes/NODENAME/pve-ssl.pem and used by default, if neither an acme certificate or a custom certificate has been deployed (which is written to /etc/pve/local/pveproxy-ssl.pem).
Upon loading of the API server (provided by pveproxy service), the corresponding certificate files are check and used to authenticate the server.

Without knowing your exact case, I would suppose that either the certificate was not installed correctly or the pveproxy service config not reloaded after deployment of the api, probably leading to your issues. As these files are all located on the proxmox cluster filesystem, changes to the cluster and loose of quorum might have an impact.

Were all of your nodes part of the cluster when you encountered the issues, was the pveproxy service restarted after joining the node to the cluster?
 
Hi Chris

The PVE 8.1.4 environment I have today was built initially with 5 nodes on PVE 7 and as I was building a cluster and testing hardware at the same time I encountered issues with cert and permissions throughout the process mostly triggered by adding and removing nodes, and also by the transition from PVE 7 to 8. It's all sorted out now, but only after a lot of attention being paid to the internal cert process that I had not paid much attention to. Here's a suggestion though that might benefit every body.

The host node which is generally the very first node in any cluster winds up being the keeper of certs and consistency for the cluster. This is as it should be, but on any other node it doesn't take much to mess up the certs. There should be a CRON driven cert check run right after the midnight update sequence that would test and report any inconsistencies or unexpected changes so it can be dealt with before failed or corrupt certs cause problems.

Lastly thanks for the attention that ProxMox Staff pay to the forum. For the record after two years of chasing, this is the platform for me.
 
Hi Chris

The PVE 8.1.4 environment I have today was built initially with 5 nodes on PVE 7 and as I was building a cluster and testing hardware at the same time I encountered issues with cert and permissions throughout the process mostly triggered by adding and removing nodes, and also by the transition from PVE 7 to 8. It's all sorted out now, but only after a lot of attention being paid to the internal cert process that I had not paid much attention to. Here's a suggestion though that might benefit every body.

The host node which is generally the very first node in any cluster winds up being the keeper of certs and consistency for the cluster. This is as it should be, but on any other node it doesn't take much to mess up the certs. There should be a CRON driven cert check run right after the midnight update sequence that would test and report any inconsistencies or unexpected changes so it can be dealt with before failed or corrupt certs cause problems.

Lastly thanks for the attention that ProxMox Staff pay to the forum. For the record after two years of chasing, this is the platform for me.
Thank you for improvement suggestions, these are best placed at https://bugzilla.proxmox.com/ in order to better keep track of them and get email notifications on updates/progress.

Regarding the handling of the certificates, I still am not sure I can fully grasp what the exact issue is and what the proposed solution might check and improve. This might be discussed in more details in an issue on the bugtracker.

I think there seems to be a misconception about the current implementation details, note that in a Proxmox VE cluster all nodes are equal with respect to voting and sharing state via the underlying Proxmox Cluster Filesystem (which is shared in the cluster via corosync to get a consistent replicated state). And as the certificates reside on this shared filesystem, all nodes in the (quorate) part of the cluster will therefore see the same state.
 
Case in point, with some nodes, and I have 8 nodes several of the newer ones that were built on PVE 8 and had Ubuntu 22.04.3 VM's built on them that were unable to connect to the VNC console by way of failure of the host node to recognize the VM certs. Rebuilt nodes also displayed cert failures, not always, and often for reasons that were not obvious.

Are you sure this was not SSH keys related for VNC specifically, which as of today (relaying) does not have much with SSL?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!