webbased configuration gone?

gijsbert

Active Member
Oct 13, 2008
47
3
28
We have had some issues with the cluster. Right now the cluster seems to be up and running again and all VM's are online. With pvecm status, I see all nodes are ok.

===
pvecm status
Quorum information
------------------
Date: Fri May 12 20:35:01 2017
Quorum provider: corosync_votequorum
Nodes: 15
Node ID: 0x00000001
Ring ID: 12/6492
Quorate: Yes

Votequorum information
----------------------
Expected votes: 15
Highest expected: 15
Total votes: 15
Quorum: 8
Flags: Quorate
===

But when I login webbased, all nodes have a unaccessible "red cross" and in stead of the servernames I only see the vm-id's. Does anyone know what to check or how to fix this issue?

Gijsbert
 
a pveproxy seems to take a long time and produces an error:

[822264.538357] INFO: task pveproxy:15176 blocked for more than 120 seconds.
[822264.538399] Tainted: G O 4.4.49-1-pve #1
[822264.538429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[822264.538480] pveproxy D ffff880ffc4b3df8 0 15176 1 0x00000004
[822264.538487] ffff880ffc4b3df8 ffff88105bfbe600 ffff88105c20f000 ffff880858598000
[822264.538490] ffff880ffc4b4000 ffff88085b71f3ac ffff880858598000 00000000ffffffff
[822264.538493] ffff88085b71f3b0 ffff880ffc4b3e10 ffffffff8185c215 ffff88085b71f3a8
[822264.538496] Call Trace:
[822264.538510] [<ffffffff8185c215>] schedule+0x35/0x80
[822264.538513] [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[822264.538516] [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[822264.538519] [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[822264.538524] [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[822264.538526] [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[822264.538530] [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75

A systemctl status pvestatd seems to be ok on all nodes except one, this nodes gives a timeout

May 12 21:06:26 <hostname> pvestatd[24622]: got timeout
May 12 21:07:26 <hostname> pvestatd[24622]: got timeout
May 12 21:10:16 <hostname> pvestatd[24622]: status update time (11.431 se...)
May 12 21:11:24 <hostname> pvestatd[24622]: status update time (8.001 sec...)
May 12 21:11:38 <hostname> pvestatd[24622]: got timeout
May 12 21:16:39 <hostname> pvestatd[24622]: got timeout
May 12 21:16:48 <hostname> pvestatd[24622]: got timeout
May 12 21:17:08 <hostname> pvestatd[24622]: got timeout
May 12 21:17:18 <hostname> pvestatd[24622]: got timeout
May 12 21:21:18 <hostname> pvestatd[24622]: got timeout

While all other nodes give something like:

May 12 19:01:48 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:40 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:31:50 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:00 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:11 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:20 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 19:32:30 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
May 12 20:03:08 <hostname> pvestatd[30440]: storage 'VM-backups-backup13'...e
 
It looks like the nodes are not responsive at all anymore. On 1 node I do a service pveproxy restart, but it hangs. If I check the status now, this is the output:

service pveproxy status
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Fri 2017-05-12 21:29:55 CEST; 25min ago
Process: 26783 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 26789 (code=exited, status=0/SUCCESS)

May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26792 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26790 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: worker 26791 finished
May 12 21:23:54 virt011.sitebytes.nl pveproxy[26789]: server stopped
May 12 21:25:24 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-sigterm time....
May 12 21:26:54 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:28:25 virt011.sitebytes.nl systemd[1]: pveproxy.service stop-final-sigter....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: pveproxy.service still around afte....
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Stopped PVE API Proxy Server.
May 12 21:29:55 virt011.sitebytes.nl systemd[1]: Unit pveproxy.service entered fail....
Hint: Some lines were ellipsized, use -l to show in full.

On a second node I try to restart pvestatd daemon, but it's also unresponsive and hangs.

Any help will be appreciated.

Gijsbert
 
What we did on every node to fix it was:

for s in pveproxy spiceproxy pvestatd pve-cluster; do /etc/init.d/$d stop; done

Check if any corosync processes are running and kill them

ps uxaw | grep corosync
killall -9 corosync

Then restart the cluster

/etc/init.d/pve-cluster start

If everything is fine start pvestatd, pveproxy en spiceproxy


Although everything seems to work now, I still get some errors in syslog

On 2 (out of 16) nodes we see:

May 13 11:01:51 virt023 pvedaemon[2239]: ipcc_send_rec failed: Transport endpoint is not connected

On all nodes I see:

May 13 10:27:58 virt023 pvestatd[27400]: storage 'VM-backups-backup13' is not online

while if I check with "pvesm status" each node reports the nfs mount. Also using "mount" it shows a mount:

172.17.2.7:/data/vm-backups on /mnt/pve/VM-backups-backup13 type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.2.7,mountvers=3,mountport=40003,mountproto=udp,local_lock=none,addr=172.17.2.7)

Why does syslog reports "Transport endpoint is not connected" and pvestatd a "storage is not online"
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!