[SOLVED] Master node in cluster cant restart pvedeamon or pveproxy

henryd99 · Dec 13, 2023

Hi,
I noticed I could not access the GUI for my master node, but could from any of the other ones nodes part of the cluster. I got pve ticket error trying to access.
I updated the NTP as they were not in sync and now they are.
Trying to restart pvedeamon and pveproxy they simply hang. I can see a load of processes that even after killing they are still there.

Cluster health >
Cluster information
-------------------
Name: main
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Dec 13 15:55:02 2023
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.4f4
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.100.100.10 (local)
0x00000002 1 10.100.100.130
0x00000003 1 10.100.50.10

Pvedeamon status and PID>

● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
Active: activating (start) since Wed 2023-12-13 15:56:29 GMT; 29s ago
Cntrl PID: 105955 (pvedaemon)
Tasks: 15 (limit: 154123)
Memory: 2.0G
CPU: 458ms
CGroup: /system.slice/pvedaemon.service
├─ 104497 /usr/bin/perl -T /usr/bin/pvedaemon stop
├─ 104632 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 104752 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 104859 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105005 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105139 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105268 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105413 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105547 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105657 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105820 /usr/bin/perl -T /usr/bin/pvedaemon start
├─ 105955 /usr/bin/perl -T /usr/bin/pvedaemon start
├─3795759 "pvedaemon worker"
├─3841929 "pvedaemon worker"
└─3932612 "pvedaemon worker"

Dec 13 15:56:29 us-prox1 systemd[1]: Starting pvedaemon.service - PVE API Daemon...

root 104497 1 0 14:33 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon stop
root 104632 1 0 14:41 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 104752 1 0 14:48 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 104859 1 0 14:56 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105005 1 0 15:03 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105139 1 0 15:11 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105268 1 0 15:18 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105413 1 0 15:26 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105547 1 0 15:33 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105657 1 0 15:41 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105820 1 0 15:48 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105955 1 0 15:56 ? 00:00:00 /usr/bin/perl -T /usr/bin/pvedaemon start
root 105976 105842 0 15:57 pts/4 00:00:00 grep pvedaemon
root 3795759 1 0 Oct18 ? 00:03:01 pvedaemon worker
root 3841929 1 0 Oct18 ? 00:02:59 pvedaemon worker
root 3932612 1 0 Oct18 ? 00:03:18 pvedaemon worker

Pveproxy status and PID>

root 104498 1 0 14:33 ? 00:00:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root 105988 105842 0 15:58 pts/4 00:00:00 grep pveproxy
root 4131038 1 0 Dec06 ? 00:00:00 /usr/bin/perl -T /usr/bin/pveproxy restart

● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
Active: deactivating (stop-sigterm) (Result: timeout) since Wed 2023-12-13 14:33:45 GMT; 1h 24min ago
Cntrl PID: 105954 (pvecm)
Tasks: 12 (limit: 154123)
Memory: 555.9M
CPU: 289ms
CGroup: /system.slice/pveproxy.service
├─ 104498 /usr/bin/perl -T /usr/bin/pveproxy stop
├─ 104860 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105006 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105140 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105269 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105414 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105548 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105658 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105821 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105954 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 105956 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
└─4131038 /usr/bin/perl -T /usr/bin/pveproxy restart

Dec 13 15:56:29 us-prox1 systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
Dec 13 15:56:59 us-prox1 pvecm[105954]: got timeout
Dec 13 15:57:59 us-prox1 systemd[1]: pveproxy.service: start-pre operation timed out. Terminating.

Im able to access and SSH into the VMs the node holds so those are still running.
Any help is greately appreciated.
Thank you

Chris · Dec 13, 2023

Hi,
can you please post the output of ps auxwf in code tags.

henryd99 said:
I updated the NTP as they were not in sync and now they are.

What command did you execute exactly? Is there anything in the systemd journal around the time when the system time was changed? What is the status of systemctl status chrony.service on that node?

Edit: Also note, there is no master node in Proxmox VE, all nodes are equal.

henryd99 · Dec 13, 2023

Chris said:
Hi,
can you please post the output of ps auxwf in code tags.

What command did you execute exactly? Is there anything in the systemd journal around the time when the system time was changed? What is the status of systemctl status chrony.service on that node?

Edit: Also note, there is no master node in Proxmox VE, all nodes are equal.

Hi,
I updated the server references on chrony and restarted the service, I could then see all 3 nodes were synced on the same server source and same time.
● chrony.service - chrony, an NTP client/server
Loaded: loaded (/lib/systemd/system/chrony.service; enabled; preset: enabled)
Active: active (running) since Wed 2023-12-13 12:31:31 GMT; 3h 36min ago
Docs: man:chronyd(8)
man:chronyc(1)
man:chrony.conf(5)
Process: 102667 ExecStart=/usr/sbin/chronyd $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 102671 (chronyd)
Tasks: 2 (limit: 154123)
Memory: 2.9M
CPU: 117ms
CGroup: /system.slice/chrony.service
├─102671 /usr/sbin/chronyd -F 1
└─102672 /usr/sbin/chronyd -F 1

Dec 13 12:31:31 us-prox1 systemd[1]: Starting chrony.service - chrony, an NTP client/server...
Dec 13 12:31:31 us-prox1 chronyd[102671]: chronyd version 4.3 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 -DEBUG)
Dec 13 12:31:31 us-prox1 chronyd[102671]: Frequency -34.495 +/- 0.278 ppm read from /var/lib/chrony/chrony.drift
Dec 13 12:31:31 us-prox1 chronyd[102671]: Using right/UTC timezone to obtain leap second data
Dec 13 12:31:31 us-prox1 chronyd[102671]: Loaded seccomp filter (level 1)
Dec 13 12:31:31 us-prox1 systemd[1]: Started chrony.service - chrony, an NTP client/server.
Dec 13 12:31:36 us-prox1 chronyd[102671]: Selected source XX.XX.XX.XX
Dec 13 12:31:36 us-prox1 chronyd[102671]: System clock TAI offset set to 37 seconds

auxwf ourput is attached, too long for the post
Thank you for the fast response.

Chris · Dec 13, 2023

henryd99 said:
Hi,
I updated the server references on chrony and restarted the service, I could then see all 3 nodes were synced on the same server source and same time.
● chrony.service - chrony, an NTP client/server
Loaded: loaded (/lib/systemd/system/chrony.service; enabled; preset: enabled)
Active: active (running) since Wed 2023-12-13 12:31:31 GMT; 3h 36min ago
Docs: man:chronyd(8)
man:chronyc(1)
man:chrony.conf(5)
Process: 102667 ExecStart=/usr/sbin/chronyd $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 102671 (chronyd)
Tasks: 2 (limit: 154123)
Memory: 2.9M
CPU: 117ms
CGroup: /system.slice/chrony.service
├─102671 /usr/sbin/chronyd -F 1
└─102672 /usr/sbin/chronyd -F 1

Dec 13 12:31:31 us-prox1 systemd[1]: Starting chrony.service - chrony, an NTP client/server...
Dec 13 12:31:31 us-prox1 chronyd[102671]: chronyd version 4.3 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 -DEBUG)
Dec 13 12:31:31 us-prox1 chronyd[102671]: Frequency -34.495 +/- 0.278 ppm read from /var/lib/chrony/chrony.drift
Dec 13 12:31:31 us-prox1 chronyd[102671]: Using right/UTC timezone to obtain leap second data
Dec 13 12:31:31 us-prox1 chronyd[102671]: Loaded seccomp filter (level 1)
Dec 13 12:31:31 us-prox1 systemd[1]: Started chrony.service - chrony, an NTP client/server.
Dec 13 12:31:36 us-prox1 chronyd[102671]: Selected source XX.XX.XX.XX
Dec 13 12:31:36 us-prox1 chronyd[102671]: System clock TAI offset set to 37 seconds

auxwf ourput is attached, too long for the post
Thank you for the fast response.

Almost all of the Proxmox VE related services are stuck in an uninterruptible sleep state, so you will have to reboot this node to recover. Unfortunately there is no way around that.

Please also check the systemd journal, I would be interested if there are any errors listed which might tell us why the services got stuck in this state. Probably corosync did not like the time change?

henryd99 · Dec 13, 2023

Chris said:
Almost all of the Proxmox VE related services are stuck in an uninterruptible sleep state, so you will have to reboot this node to recover. Unfortunately there is no way around that.

Please also check the systemd journal, I would be interested if there are any errors listed which might tell us why the services got stuck in this state. Probably corosync did not like the time change?

Ok, I understand.
Im not able to gracefully shut down VMs on this node as everything just hangs.
What would be the safest way to reboot? any commands before running "reboot"?
Worried about data coruption

Chris · Dec 13, 2023

henryd99 said:
Ok, I understand.
Im not able to gracefully shut down VMs on this node as everything just hangs.
What would be the safest way to reboot? any commands before running "reboot"?
Worried about data coruption

I only see one running VM from your outputs. What command did you execute to shutdown the VM, did you try to shutdown from within the VM? You can try to either create a backup and a snapshot of the VM or try to migrate the VM to another node. If that does not work, you can try to create a backup from within the VM.

henryd99 · Dec 13, 2023

Chris said:
I only see one running VM from your outputs. What command did you execute to shutdown the VM, did you try to shutdown from within the VM? You can try to either create a backup and a snapshot of the VM or try to migrate the VM to another node. If that does not work, you can try to create a backup from within the VM.

Ok, currently I have a proxmox backup server that takes daily snapshots. Only thing is I cant access the GUI as the pveproxy is not working. My node shows a "?" on the VMs, and storages. So I cant "interact" with anything at the moment. I guess Ill just have to reboot and verify my data integrity. Thank you for the fast responses

Chris · Dec 13, 2023

henryd99 said:
Ok, currently I have a proxmox backup server that takes daily snapshots. Only thing is I cant access the GUI as the pveproxy is not working. My node shows a "?" on the VMs, and storages. So I cant "interact" with anything at the moment. I guess Ill just have to reboot and verify my data integrity. Thank you for the fast responses

You can also perform these tasks via cli, e.g. qm migrate <vmid> <target>, qm snapshot <vmid> <snapname>, or for a backup vzdump <vmid>. Although unsure how many of these work in your systems current state.

Edit: add missing required parameters to commands.

henryd99 · Dec 14, 2023

Reboot fixed everything, thank you

Chris · Dec 14, 2023

Could you nevertheless share your systemd journal output, in order to understand what might have happened in this particular case? You can genereate a dump via journalctl --since <DATETIME> --until <DATETIME> > journal.txt from around the time you ran the commands leading to the services being blocked. Thx

Edit: Added the dump to file

esi_y · Dec 14, 2023

Chris said:
Almost all of the Proxmox VE related services are stuck in an uninterruptible sleep state, so you will have to reboot this node to recover. Unfortunately there is no way around that.

Is there any known situation this would be a desirable outcome for PVE services to end up in?

Chris · Dec 14, 2023

tempacc346235 said:
Is there any known situation this would be a desirable outcome for PVE services to end up in?

I am sorry but I don't really understand the intention of your question. To answer: no there is no known situation where it would be a "desirable" outcome for the services to end up in such a state.

esi_y · Dec 14, 2023

Chris said:
I am sorry but I don't really understand the intention of your question. To answer: no there is no known situation where it would be a "desirable" outcome for the services to end up in such a state.

It was more for the OP ( he posted this in another thread too* ) to not just conclude that "Reboot fixed everything" and check the logs (as you had requested) because it should have never ended up in such a state (or so I thought). Either it's a bug worth digging out or it's something that out of ordinary that led to this it would be helpful for anyone else later to also know about (from this thread).

* https://forum.proxmox.com/threads/pveproxy-stuck.102607/#post-615519

Chris · Dec 14, 2023

tempacc346235 said:
It was more for the OP ( he posted this in another thread too ) to not just conclude that "Reboot fixed everything" and check the logs (as you had requested) because it should have never ended up in such a state (or so I thought). Either it's a bug worth digging out or it's something that out of ordinary that led to this it would be helpful for anyone else later to also know about (from this thread).

Agreed, it would be great to find out why this state was reached. Not sure the NTP config changes had directly influence on this or the state was already degraded before. The logs will hopefully tell the story.

Search

Search

[SOLVED] Master node in cluster cant restart pvedeamon or pveproxy

henryd99

New Member

Chris

Proxmox Staff Member

henryd99

New Member

Attachments

Chris

Proxmox Staff Member

henryd99

New Member

Chris

Proxmox Staff Member

henryd99

New Member

Chris

Proxmox Staff Member

henryd99

New Member

Chris

Proxmox Staff Member

esi_y

Renowned Member

Chris

Proxmox Staff Member

esi_y

Renowned Member

Chris

Proxmox Staff Member