GUI Login is not working anymore via AD

Laveba · Oct 17, 2023

We have an issue with proxmox login through AD and other login is not been setup by the system engineers in the past. Zero documentation at all. What we have found out is that 2 nodes are degraded in the cluster and I think on them is the GUI running. SSH connection is working on those nodes but we have to login on the GUI. What is the reason for this issue and how can we resolve it?

The second cluster what we have is working fine but the other one has an issue what we can not understand why this happens suddenly.

Thanks and best regards,

Daniel

Stoiko Ivanov · Oct 17, 2023

Laveba said:
SSH connection is working on those nodes but we have to login on the GUI.

If you can get root through the ssh connection you can set the password for root - then login to the GUI using root and realm PAM

else to see what might be going on - please share the journal of the node - especially while trying to login ...

Laveba · Oct 17, 2023

that is the result of root login because the employee before he setup the root only for ssh and not for login on the GUI. no pve accounts or else. only with AD users. restart of cluster service also not helped and we have following messages found on the vmhost01 and 02:

virtual machines are running but we can't get on the GUI anymore. and we are getting following messages with cmd: journalctl -u corosync.service

Stoiko Ivanov · Oct 17, 2023

Laveba said:
hat is the result of root login because the employee before he setup the root only for ssh and not for login on the GUI.

This sounds odd and would take quite a bit of additional configuration - make sure that the password is correct for root (n a shell as root type `passwd` to change it) - unless the machine has it's root account also backed by LDAP/AD?

please post the logs as text files instead of as screenshots from putty...
The journal can be viewed with `journalctl --since '12:00'` (just one example)

the knet errors might indicate an issue with IPs not matching up with the hostnames in /etc/hosts - so make sure this works

I assume you don't have HA active and that the system currently has no quorum? (`ha-manger status` and `pvecm status` would tell you)
if you have no HA - consider restarting corosync

Laveba · Oct 17, 2023

we have also this issue:

Oct 17 15:02:09 vmhost01 pvescheduler[2749551]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Oct 17 15:03:09 vmhost01 pvescheduler[2750290]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Oct 17 15:03:09 vmhost01 pvescheduler[2750289]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Oct 17 15:04:09 vmhost01 pvescheduler[2750962]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Oct 17 15:04:09 vmhost01 pvescheduler[2750961]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

Laveba · Oct 17, 2023

this is the result of ha-manager status

Laveba · Oct 17, 2023

Stoiko Ivanov said:
This sounds odd and would take quite a bit of additional configuration - make sure that the password is correct for root (n a shell as root type `passwd` to change it) - unless the machine has it's root account also backed by LDAP/AD?

please post the logs as text files instead of as screenshots from putty...
The journal can be viewed with `journalctl --since '12:00'` (just one example)

the knet errors might indicate an issue with IPs not matching up with the hostnames in /etc/hosts - so make sure this works

I assume you don't have HA active and that the system currently has no quorum? (`ha-manger status` and `pvecm status` would tell you)
if you have no HA - consider restarting corosync

we are already via ssh with the root user in and that password is correct. we already restarted corosync and pve-cluster and nothing helped.

Stoiko Ivanov · Oct 17, 2023

Laveba said:
we are already via ssh with the root user in and that password is correct. we already restarted corosync and pve-cluster and nothing helped.

What does the journal print if you try to login to the GUI with root@pam? - please post as text

Laveba · Oct 17, 2023

Stoiko Ivanov said:
What does the journal print if you try to login to the GUI with root@pam? - please post as text

I don't know how should i post you that as text. sorry and login is not working what you can see on the printscreen:

Stoiko Ivanov · Oct 17, 2023

Laveba said:
I don't know how should i post you that as text. sorry and login is not working what you can see on the printscreen:

let `journalctl -f` run in a ssh-session (there you can copy the text...)
then try to login ... paste what gets written to the journal (in your ssh session...)

Laveba · Oct 17, 2023

Hello Stoiko, we are hardly troubleshooting on this issue. Now we tried to force local login with pxmcfs -l and the login was fine even with AD credentials. We have a cluster with 16 nodes. Could you please tell me what could be the fastest way to rebuild the cluster. Maybe deleting and recreate from scratch might be a solution? Or is there another way minimal invasive? The VMs should be kept up and running if possible.

Stoiko Ivanov · Oct 17, 2023

Please post the logs as asked multiple times - without information where the issue in your network stack is it is impossible to guess what the proper way forward would be...

the output of `pvecm status` would also help

Laveba · Oct 17, 2023

root@vmhost01:/etc/pve# pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

and journalctl -f

Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 corosync[2773523]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Oct 17 16:11:55 vmhost01 pveproxy[2742981]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:11:55 vmhost01 pveproxy[2742981]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:11:55 vmhost01 pveproxy[2742981]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:11:58 vmhost01 pve-ha-lrm[2716]: updating service status from manager failed: Connection refused

Laveba · Oct 17, 2023

and journalctl -xe

root@vmhost01:/etc/pve# journalctl -xe
░░ Support: https://www.debian.org/support
░░
░░ The unit pve-cluster.service has entered the 'failed' state with result 'exit-code'.
Oct 17 16:12:56 vmhost01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
░░ Subject: A start job for unit pve-cluster.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit pve-cluster.service has finished with a failure.
░░
░░ The job identifier is 164061657 and the job result is failed.
Oct 17 16:12:56 vmhost01 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit pve-cluster.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Oct 17 16:12:56 vmhost01 systemd[1]: Stopped The Proxmox VE cluster filesystem.
░░ Subject: A stop job for unit pve-cluster.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit pve-cluster.service has finished.
░░
░░ The job identifier is 164061748 and the job result is done.
Oct 17 16:12:56 vmhost01 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 17 16:12:56 vmhost01 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit pve-cluster.service has entered the 'failed' state with result 'exit-code'.
Oct 17 16:12:56 vmhost01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
░░ Subject: A start job for unit pve-cluster.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit pve-cluster.service has finished with a failure.
░░
░░ The job identifier is 164061748 and the job result is failed.
Oct 17 16:12:57 vmhost01 pveproxy[2742981]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:12:57 vmhost01 pveproxy[2742981]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:12:57 vmhost01 pveproxy[2742981]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:12:58 vmhost01 pve-ha-lrm[2716]: updating service status from manager failed: Connection refused
Oct 17 16:12:58 vmhost01 corosync[2773523]: [KNET ] link: host: 14 link: 0 is down
Oct 17 16:12:59 vmhost01 pveproxy[2742981]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:12:59 vmhost01 pveproxy[2742981]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:12:59 vmhost01 pveproxy[2742981]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:13:00 vmhost01 pvescheduler[2775606]: replication: Connection refused
Oct 17 16:13:00 vmhost01 pvescheduler[2775607]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:13:00 vmhost01 pveproxy[2742981]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:13:01 vmhost01 cron[2655]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Oct 17 16:13:02 vmhost01 corosync[2773523]: [KNET ] link: host: 11 link: 0 is down
Oct 17 16:13:02 vmhost01 pve-firewall[2673]: status update error: Connection refused
Oct 17 16:13:03 vmhost01 pve-ha-lrm[2716]: updating service status from manager failed: Connection refused
Oct 17 16:13:04 vmhost01 corosync[2773523]: [KNET ] rx: host: 14 link: 0 is up
Oct 17 16:13:05 vmhost01 pvestatd[2743197]: ipcc_send_rec[1] failed: Connection refused
Oct 17 16:13:05 vmhost01 pvestatd[2743197]: ipcc_send_rec[2] failed: Connection refused
Oct 17 16:13:05 vmhost01 pvestatd[2743197]: ipcc_send_rec[3] failed: Connection refused
Oct 17 16:13:05 vmhost01 pvestatd[2743197]: ipcc_send_rec[4] failed: Connection refused
Oct 17 16:13:05 vmhost01 pvestatd[2743197]: status update error: Connection refused

Stoiko Ivanov · Oct 17, 2023

Laveba said:
oot@vmhost01:/etc/pve# pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

hm :
does the node still have enough free diskspace?:
https://forum.proxmox.com/threads/cluster-broken-after-add-node-failed.91464/post-399189

* else - do all nodes in the cluster have the same error-messages?
* can the nodes ping each other on the interfaces configured in /etc/corosync/corosync.conf ?
* do all nodes (and switches and other equipment in between) agree on the MTU in the network?
* how does the corosync.conf look like...

Laveba · Oct 17, 2023

and the pvestatus from another host:
root@vmhost03:~# pvecm status
Cluster information
-------------------
Name: XXXXXXX
Config Version: 39
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Oct 17 16:15:50 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000010
Ring ID: 10.5e76
Quorate: No

Votequorum information
----------------------
Expected votes: 17
Highest expected: 17
Total votes: 1
Quorum: 9 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000010 1 10.100.XX.XXX (local)

Laveba · Oct 17, 2023

usage on host is max 71percent. So there should be enough space

Laveba · Oct 17, 2023

Hosts can ping each other. No network issue at this time. Only 2 hosts have the error messages but we event could no logon to the other hosts on GUI.

corosync.conf:

root@vmhost01:/etc/corosync# cat corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: vmProxmox-22
nodeid: 5
quorum_votes: 1
ring0_addr: 10.100.xx.155
}
node {
name: vmhost01
nodeid: 1
quorum_votes: 1
ring0_addr: 10.100.xx.151
}
node {
name: vmhost02
nodeid: 4
quorum_votes: 1
ring0_addr: 10.100.xx.152
}
node {
name: vmhost03
nodeid: 16
quorum_votes: 1
ring0_addr: 10.100.xx.153
}
node {
name: vmhost12
nodeid: 6
quorum_votes: 1
ring0_addr: 10.100.xx.162
}
node {
name: vmhost13
nodeid: 9
quorum_votes: 1
ring0_addr: 10.100.xx.163
}
node {
name: vmhost18
nodeid: 11
quorum_votes: 1
ring0_addr: 10.100.xx.168
}
node {
name: vmhost19
nodeid: 2
quorum_votes: 1
ring0_addr: 10.100.xx.169
}
node {
name: vmhost21
nodeid: 7
quorum_votes: 1
ring0_addr: 10.100.xx.171
}
node {
name: vmhost22
nodeid: 8
quorum_votes: 1
ring0_addr: 10.100.xx.172
}
node {
name: vmhost23
nodeid: 12
quorum_votes: 1
ring0_addr: 10.100.xx.173
}
node {
name: vmhost24
nodeid: 3
quorum_votes: 1
ring0_addr: 10.100.xx.174
}
node {
name: vmhost25
nodeid: 17
quorum_votes: 1
ring0_addr: 10.100.xx.175
}
node {
name: vmhost31
nodeid: 10
quorum_votes: 1
ring0_addr: 10.100.xx.181
}
node {
name: vmhost52
nodeid: 15
quorum_votes: 1
ring0_addr: 10.100.xx.32
}
node {
name: vmhost53
nodeid: 14
quorum_votes: 1
ring0_addr: 10.100.xx.33
}
node {
name: vmproxmox-04
nodeid: 18
quorum_votes: 1
ring0_addr: 10.100.xx.154
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: XXXXXX
config_version: 39
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}

Laveba · Oct 17, 2023

On vmhost01 we tried:
systemctl stop pve-cluster corosync
pmxcfs -l
rm -rf /etc/corosync/*
rm /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

Then we copied back the corosync.conf and the /etc/corosync directory. After this the vmhost01 doesn't find its way back to be able to start pve-cluster service.

Stoiko Ivanov · Oct 17, 2023

Laveba said:
On vmhost01 we tried:
systemctl stop pve-cluster corosync
pmxcfs -l
rm -rf /etc/corosync/*
rm /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

The removal of /etc/corosync was probably the root of the issues ...
Try copying back /etc/corosync from a node that does not have the issue - and restart corosync and pve-cluster
(make sure that everything in the corosync config seems ok and hasn't changed)

EDIT: also copy the latest corosync.conf to /etc/pve/corosync.conf on the node where you deleted it, before restarting the services!

from the nodes that do not have issues - what's the output of `pvecm status` ?
What's in the journal on a node without issues ?

GUI Login is not working anymore via AD

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Proxmox Staff Member