webbased configuration gone?

gijsbert

Active Member
Oct 13, 2008
47
3
28
We have had some issues with the cluster. Right now the cluster seems to be up and running again and all VM's are online. With pvecm status, I see all nodes are ok.

===
pvecm status
Quorum information
------------------
Date: Fri May 12 20:35:01 2017
Quorum provider: corosync_votequorum
Nodes: 15
Node ID: 0x00000001
Ring ID: 12/6492
Quorate: Yes

Votequorum information
----------------------
Expected votes: 15
Highest expected: 15
Total votes: 15
Quorum: 8
Flags: Quorate
===

But when I login webbased, all nodes have a unaccessible "red cross" and in stead of the servernames I only see the vm-id's. Does anyone know what to check or how to fix this issue?

Gijsbert
 
a pveproxy seems to take a long time and produces an error:

[822264.538357] INFO: task pveproxy:15176 blocked for more than 120 seconds.
[822264.538399] Tainted: G O 4.4.49-1-pve #1
[822264.538429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[822264.538480] pveproxy D ffff880ffc4b3df8 0 15176 1 0x00000004
[822264.538487] ffff880ffc4b3df8 ffff88105bfbe600 ffff88105c20f000 ffff880858598000
[822264.538490] ffff880ffc4b4000 ffff88085b71f3ac ffff880858598000 00000000ffffffff
[822264.538493] ffff88085b71f3b0 ffff880ffc4b3e10 ffffffff8185c215 ffff88085b71f3a8
[822264.538496] Call Trace:
[822264.538510] [<ffffffff8185c215>] schedule+0x35/0x80
[822264.538513] [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[822264.538516] [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[822264.538519] [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[822264.538524] [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[822264.538526] [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[822264.538530] [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75

A systemctl status pvestatd seems to be ok on all nodes except one, this nodes gives a timeout

May 12 21:06:26 <hostname> pvestatd[24622]: got timeout
May 12 21:07:26 <hostname> pvestatd[24622]: got timeout
May 12 21:10:16 <hostname> pvestatd[24622]: status update time (11.431 se...)
May 12 21:11:24 <hostname> pvestatd[24622]: status update time (8.001 sec...)
May 12 21:11:38 <hostname> pvestatd[24622]: got timeout
May 12 21:16:39 <hostname> pvestatd[24622]: got timeout
May 12 21:16:48 <hostname> pvestatd[24622]: got timeout
May 12 21:17:08 <hostname> pvestatd[24622]: got timeout
May 12 21:17:18 <hostname> pvestatd[24622]: got timeout
May 12 21:21:18 <hostname> pvestatd[24622]: got timeout

While all other nodes give something like:

May 12 19:01:48 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:40 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:50 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:00 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:11 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 20:03:08 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
 
It looks like the nodes are not responsive at all anymore. On 1 node I do a service pveproxy restart, but it hangs. If I check the status now, this is the output:

service pveproxy status
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Fri 2017-05-12 21:29:55 CEST; 25min ago
Process: 26783 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 26789 (code=exited, status=0/SUCCESS)

May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26792 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26790 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26791 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: server stopped
May 12 21:25:24 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-sigterm time....
May 12 21:26:54 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:28:25 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-final-sigter....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Stopped PVE API Proxy Server.
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Unit pveproxy.service entered fail....
Hint: Some lines were ellipsized, use -l to show in full.

On a second node I try to restart pvestatd daemon, but it's also unresponsive and hangs.

Any help will be appreciated.

Gijsbert
 
What we did on every node to fix it was:

for s in pveproxy spiceproxy pvestatd pve-cluster; do /etc/init.d/$d stop; done

Check if any corosync processes are running and kill them

ps uxaw | grep corosync
killall -9 corosync

Then restart the cluster

/etc/init.d/pve-cluster start

If everything is fine start pvestatd, pveproxy en spiceproxy


Although everything seems to work now, I still get some errors in syslog

On 2 (out of 16) nodes we see:

May 13 11:01:51 virt023 pvedaemon[2239]: ipcc_send_rec failed: Transport endpoint is not connected

On all nodes I see:

May 13 10:27:58 virt023 pvestatd[27400]: storage 'VM-backups-backup13' is not online

while if I check with "pvesm status" each node reports the nfs mount. Also using "mount" it shows a mount:

172.17.2.7:/data/vm-backups on /mnt/pve/VM-backups-backup13 type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.2.7,mountvers=3,mountport=40003,mountproto=udp,local_lock=none,addr=172.17.2.7)

Why does syslog reports "Transport endpoint is not connected" and pvestatd a "storage is not online"