webbased configuration gone?

gijsbert · May 12, 2017

We have had some issues with the cluster. Right now the cluster seems to be up and running again and all VM's are online. With pvecm status, I see all nodes are ok.

===
pvecm status
Quorum information
------------------
Date: Fri May 12 20:35:01 2017
Quorum provider: corosync_votequorum
Nodes: 15
Node ID: 0x00000001
Ring ID: 12/6492
Quorate: Yes

Votequorum information
----------------------
Expected votes: 15
Highest expected: 15
Total votes: 15
Quorum: 8
Flags: Quorate
===

But when I login webbased, all nodes have a unaccessible "red cross" and in stead of the servernames I only see the vm-id's. Does anyone know what to check or how to fix this issue?

Gijsbert

dietmar · May 12, 2017

Seems pvestatd does not run on all nodes? Test with

# systemctl status pvestatd

udo · May 12, 2017

Hi,
I assume on one node pveproxy/pvedaemon make troubles.
Do an "service pveproxy restart" + "service pvedaemon restart" on all nodes (one by one) and see if your nodes came back on the web-gui.

Udo

gijsbert · May 12, 2017

a pveproxy seems to take a long time and produces an error:

[822264.538357] INFO: task pveproxy:15176 blocked for more than 120 seconds.
[822264.538399] Tainted: G O 4.4.49-1-pve #1
[822264.538429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[822264.538480] pveproxy D ffff880ffc4b3df8 0 15176 1 0x00000004
[822264.538487] ffff880ffc4b3df8 ffff88105bfbe600 ffff88105c20f000 ffff880858598000
[822264.538490] ffff880ffc4b4000 ffff88085b71f3ac ffff880858598000 00000000ffffffff
[822264.538493] ffff88085b71f3b0 ffff880ffc4b3e10 ffffffff8185c215 ffff88085b71f3a8
[822264.538496] Call Trace:
[822264.538510] [<ffffffff8185c215>] schedule+0x35/0x80
[822264.538513] [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[822264.538516] [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[822264.538519] [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[822264.538524] [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[822264.538526] [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[822264.538530] [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75

A systemctl status pvestatd seems to be ok on all nodes except one, this nodes gives a timeout

May 12 21:06:26 <hostname> pvestatd[24622]: got timeout
May 12 21:07:26 <hostname> pvestatd[24622]: got timeout
May 12 21:10:16 <hostname> pvestatd[24622]: status update time (11.431 se...)
May 12 21:11:24 <hostname> pvestatd[24622]: status update time (8.001 sec...)
May 12 21:11:38 <hostname> pvestatd[24622]: got timeout
May 12 21:16:39 <hostname> pvestatd[24622]: got timeout
May 12 21:16:48 <hostname> pvestatd[24622]: got timeout
May 12 21:17:08 <hostname> pvestatd[24622]: got timeout
May 12 21:17:18 <hostname> pvestatd[24622]: got timeout
May 12 21:21:18 <hostname> pvestatd[24622]: got timeout

While all other nodes give something like:

May 12 19:01:48 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:40 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:50 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:00 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:11 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 20:03:08 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e

udo · May 12, 2017

Hi,
is it the same node where pveproxy give the error and pvestatd the timeout?

Is your storage on all nodes accessible? Mean, doe an

Code:

pvesm status

show all storage on all nodes?

Udo

udo · May 12, 2017

BTW. you can also restart pvestatd on the one node.

gijsbert · May 12, 2017

It looks like the nodes are not responsive at all anymore. On 1 node I do a service pveproxy restart, but it hangs. If I check the status now, this is the output:

service pveproxy status
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Fri 2017-05-12 21:29:55 CEST; 25min ago
Process: 26783 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 26789 (code=exited, status=0/SUCCESS)

May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26792 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26790 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26791 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: server stopped
May 12 21:25:24 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-sigterm time....
May 12 21:26:54 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:28:25 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-final-sigter....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Stopped PVE API Proxy Server.
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Unit pveproxy.service entered fail....
Hint: Some lines were ellipsized, use -l to show in full.

On a second node I try to restart pvestatd daemon, but it's also unresponsive and hangs.

Any help will be appreciated.

Gijsbert

gijsbert · May 13, 2017

What we did on every node to fix it was:

for s in pveproxy spiceproxy pvestatd pve-cluster; do /etc/init.d/$d stop; done

Check if any corosync processes are running and kill them

ps uxaw | grep corosync
killall -9 corosync

Then restart the cluster

/etc/init.d/pve-cluster start

If everything is fine start pvestatd, pveproxy en spiceproxy

Although everything seems to work now, I still get some errors in syslog

On 2 (out of 16) nodes we see:

May 13 11:01:51 virt023 pvedaemon[2239]: ipcc_send_rec failed: Transport endpoint is not connected

On all nodes I see:

May 13 10:27:58 virt023 pvestatd[27400]: storage 'VM-backups-backup13' is not online

while if I check with "pvesm status" each node reports the nfs mount. Also using "mount" it shows a mount:

172.17.2.7:/data/vm-backups on /mnt/pve/VM-backups-backup13 type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.2.7,mountvers=3,mountport=40003,mountproto=udp,local_lock=none,addr=172.17.2.7)

Why does syslog reports "Transport endpoint is not connected" and pvestatd a "storage is not online"

Search

Search

webbased configuration gone?

gijsbert

Active Member

dietmar

Proxmox Staff Member

udo

Distinguished Member

gijsbert

Active Member

udo

Distinguished Member

udo

Distinguished Member

gijsbert

Active Member

gijsbert

Active Member

We value your privacy