Unable to login via Web GUI; SSH works

since the vzdump command doesn't work right now,
vzdump doesnt depend on pveproxy; it should work just fine- just from cli. you can also stop a vm using qm stop, and then copy or use qemu-img to backup if you prefer. your virtualization platform is still functioning.

My understanding of clustering and HA in general is that it is really meant for
your understanding is correct.

each site has only 1 server dedicated to it.
well, herein is the rub. you run production without a safety net. luckily for you, this issue is not catastrophic as your hypervisor is still providing services, but what happens when the server fails in a more meaningful way? how much productivity or revenue will be lost per minute of outage? This may be a good moment of clarity to evaluate if the solution put in production is adequate to the need.
 
vzdump doesnt depend on pveproxy; it should work just fine- just from cli. you can also stop a vm using qm stop, and then copy or use qemu-img to backup if you prefer. your virtualization platform is still functioning.

How is it going to run without pvestatd?

well, herein is the rub. you run production without a safety net. luckily for you, this issue is not catastrophic as your hypervisor is still providing services, but what happens when the server fails in a more meaningful way? how much productivity or revenue will be lost per minute of outage? This may be a good moment of clarity to evaluate if the solution put in production is adequate to the need.

The rub is not applicable here, even if the OP had e.g. 3 nodes at any single site, had he configured them all the same, he would be now toast, except we would be wondering if it's not cluster related as well. He would need to "recover" (if that's what you call a startover) from backups with downtime either way.
 
Dec 11 is the time you first installed this node and set up the AD as well? Was everything in terms of PVE running for you after the AD setup? Or you cannot confirm?

Btw you can start the sssd service again now.
Yes, that's when it was installed and everything was running well for about a week. I stopped logging in after that until now.

I still have the ISO I used to install Proxmox in the first place and I decided to do a test. I took a spare server and installed Proxmox. I then did an apt update && apt upgrade -y and all hell broke loose, throwing similar errors to what I have on the production servers. I rebooted it and it says:

Loading Linux 6.5.11-8-pve ...
error: file 'vmlinuz-6.5.11-8-pve' not found.
Loading initial ramdisk ...
error: you need to load the kernel first.

That confirms that using apt upgrade caused the issue and makes me fear a power outage. I'm glad I caught this before servers started actually dying.
 
Code:
root@STP-HV:~# ls -1 /etc/pam*
/etc/pam.conf


/etc/pam.d:
chfn
chpasswd
chsh
common-account
common-auth
common-password
common-session
common-session-noninteractive
cron
login
newusers
other
passwd
runuser
runuser-l
samba
sshd
sssd-shadowutils
su
su-l

Can you show content of (redacted as needed) /etc/nsswitch.conf
 
Yes, that's when it was installed and everything was running well for about a week. I stopped logging in after that until now.

I still have the ISO I used to install Proxmox in the first place and I decided to do a test. I took a spare server and installed Proxmox. I then did an apt update && apt upgrade -y and all hell broke loose, throwing similar errors to what I have on the production servers. I rebooted it and it says:

Loading Linux 6.5.11-8-pve ...
error: file 'vmlinuz-6.5.11-8-pve' not found.
Loading initial ramdisk ...
error: you need to load the kernel first.

That confirms that using apt upgrade caused the issue and makes me fear a power outage. I'm glad I caught this before servers started actually dying.
But are you saying that if you now - additionally - also run dist-upgrade, it will not fix it either?
 
Can you show content of (redacted as needed) /etc/nsswitch.conf
Code:
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         files systemd sss
group:          files systemd sss
shadow:         files systemd sss
gshadow:        files systemd

hosts:          files dns
networks:       files

protocols:      db files
services:       db files sss
ethers:         db files
rpc:            db files

netgroup:       nis sss
automount:  sss
 
Code:
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         files systemd sss
group:          files systemd sss
shadow:         files systemd sss
gshadow:        files systemd

hosts:          files dns
networks:       files

protocols:      db files
services:       db files sss
ethers:         db files
rpc:            db files

netgroup:       nis sss
automount:  sss
Could you do a test? Backup this version of the file, then remove the references to sss (completely wipe the automount).

EDIT: Or since you have the fresh install at hand, just copy paste the default PVE shipped content into here.
 
Last edited:
Yes, that's when it was installed and everything was running well for about a week. I stopped logging in after that until now.

I still have the ISO I used to install Proxmox in the first place and I decided to do a test. I took a spare server and installed Proxmox. I then did an apt update && apt upgrade -y and all hell broke loose, throwing similar errors to what I have on the production servers. I rebooted it and it says:

Loading Linux 6.5.11-8-pve ...
error: file 'vmlinuz-6.5.11-8-pve' not found.
Loading initial ramdisk ...
error: you need to load the kernel first.

That confirms that using apt upgrade caused the issue and makes me fear a power outage. I'm glad I caught this before servers started actually dying.

Also, i suppose you never rebooted your production nodes since. Which eventually you will need to. I do not think you messed up anything by running upgrade prior to dist-upgrade, especially as you have not rebooted yet, but you should reboot.

If you would feel better, go try on the fresh node to run your apt update, upgrade, then eventually also dist-upgrade ... and then reboot.
 
Also, i suppose you never rebooted your production nodes since. Which eventually you will need to. I do not think you messed up anything by running upgrade prior to dist-upgrade, especially as you have not rebooted yet, but you should reboot.

If you would feel better, go try on the fresh node to run your apt update, upgrade, then eventually also dist-upgrade ... and then reboot.
Right, we haven't rebooted them. The issue is that I can't run a dist-upgrade at all, and now if I reboot it won't be able to load the kernel. I'm just going to have to redo all my work and take this as a lesson to RTFM.
 
Right, we haven't rebooted them. The issue is that I can't run a dist-upgrade at all,

What's the result if you run update && dist-upgrade now?

and now if I reboot it won't be able to load the kernel. I'm just going to have to redo all my work and take this as a lesson to RTFM.

In my opinion, this is also stupid on part of PVE, for one it has own pveupdate|upgrade scripts it should push for everyone to run and since it's so non-standard, it should patch apt or alias at the least. There's so many threads here one would have thought it's a major pain point when they break normal Debian behaviour.
 
What's the result if you run update && dist-upgrade now?
Code:
root@STP-HV:~# apt update && dist-upgrade
Get:1 http://download.proxmox.com/debian/pve bookworm InRelease [2,768 B]
Get:2 http://security.debian.org bookworm-security InRelease [48.0 kB]
Get:3 http://security.debian.org bookworm-security/main amd64 Packages [137 kB]
Get:4 http://download.proxmox.com/debian/pve bookworm/pve-no-subscription amd64 Packages [232 kB]
Get:5 http://security.debian.org bookworm-security/main Translation-en [81.4 kB]
Get:6 http://security.debian.org bookworm-security/contrib amd64 Packages [644 B]
Hit:7 http://ftp.us.debian.org/debian bookworm InRelease
Get:8 http://ftp.us.debian.org/debian bookworm-updates InRelease [52.1 kB]
Fetched 554 kB in 1s (547 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
8 packages can be upgraded. Run 'apt list --upgradable' to see them.
-bash: dist-upgrade: command not found
Code:
root@STP-HV:~# apt dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up pve-manager (8.1.4) ...
  LVM configuration valid.
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xeu pvestatd.service" for details.
dpkg: error processing package pve-manager (--configure):
 installed pve-manager package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 pve-manager
E: Sub-process /usr/bin/dpkg returned an error code (1)
 
Code:
root@STP-HV:~# apt update && dist-upgrade
Get:1 http://download.proxmox.com/debian/pve bookworm InRelease [2,768 B]
Get:2 http://security.debian.org bookworm-security InRelease [48.0 kB]
Get:3 http://security.debian.org bookworm-security/main amd64 Packages [137 kB]
Get:4 http://download.proxmox.com/debian/pve bookworm/pve-no-subscription amd64 Packages [232 kB]
Get:5 http://security.debian.org bookworm-security/main Translation-en [81.4 kB]
Get:6 http://security.debian.org bookworm-security/contrib amd64 Packages [644 B]
Hit:7 http://ftp.us.debian.org/debian bookworm InRelease
Get:8 http://ftp.us.debian.org/debian bookworm-updates InRelease [52.1 kB]
Fetched 554 kB in 1s (547 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
8 packages can be upgraded. Run 'apt list --upgradable' to see them.
-bash: dist-upgrade: command not found
Code:
root@STP-HV:~# apt dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up pve-manager (8.1.4) ...
  LVM configuration valid.
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xeu pvestatd.service" for details.
dpkg: error processing package pve-manager (--configure):
 installed pve-manager package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 pve-manager
E: Sub-process /usr/bin/dpkg returned an error code (1)

Can you afford to take one of the nodes as guinea pigs to clean up and reboot?

What are the apt list --upgradable ?
 
Can you afford to take one of the nodes as guinea pigs to clean up and reboot?

What are the apt list --upgradable ?
I can't unfortunately. I'm planning to reinstall the first one tomorrow night.

Apt list --upgradable is empty...
Code:
root@STP-HV:~# apt list --upgradable
Listing... Done
 
I can't unfortunately. I'm planning to reinstall the first one tomorrow night.

Apt list --upgradable is empty...
Code:
root@STP-HV:~# apt list --upgradable
Listing... Done

And you cannot even reproduce this limbo state on your fresh install now? I would simply apt remove pve-manager and then apt install it, config should be left in place. To begin with. It might even be enough.
 
And you cannot even reproduce this limbo state on your fresh install now? I would simply apt remove pve-manager and then apt install it, config should be left in place. To begin with. It might even be enough.
Maybe I will try that tomorrow night before going nuclear...
 
Maybe I will try that tomorrow night before going nuclear...
Ironically it might be that all you need is a reboot. I never assumed before this was never rebooted. I thought you claimed apt upgrade was run instead of dist-upgrade. My point was since the dist-upgrade was run eventually, it should not matter.

I understand you want to take your time to reboot even. :)

If you could reproduce this odd state on a blank node, it would have been easier for you. Other than that take copy of what's in /etc/pve of the node you are going to reconstruct tomorrow, it might save you some extra efforts if you need to start from scratch with a new node.
 
NB The log:
Code:
 Dec 11 10:31:07 STP-HV ldap_child[833698]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Client 'STP-HV$@[REDACTED DOMAIN]' not found in Kerberos database. Unable to create GSSAPI-encrypted LDAP connection.
... was also not good to see. And repeating often ever after.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!