Access to Web-GUI problem

gosha

Well-Known Member
Oct 20, 2014
302
26
58
Russia
Hi!

After solving the problem with unexpected reboots:
https://forum.proxmox.com/threads/unexpected-reboots-help-need.34310/
A new problem has arrived... :)

Preamble:
This cluster is used as a storage for VMs and CTs backups from other cluster (which works
without any problems). This backup cluster has three nodes: two physical servers and the third
server is a virtual (for quorum and ceph-mon only) and is running in other cluster.
In this backup cluster running two VMs (both debian 8.x):
  • NFS-server (for linux VMs and CTs backups)
  • SAMBA-server (for windows VMs backups)
Used CEPH-storage. Backup operations occur at night.

pic.png

Problem:
Some time after the start of backup tasks I can't connect to Web-GUI on both physical nodes.
The third (virtual) node's Web-GUI is available, but any operations with the first two nodes are
impossible - timeout... :( During this situation, the backup tasks are executed and terminated normally.
After backup task termination (in the morning), access to the Web-GUI on first two (physical) nodes
is not restored. I'm still solving this problem by rebooting two (physical) nodes only.
All day this doesn't happen. Apparently due to the fact that the cluster is not actually loaded...

How to find the reason for this abnormal behavior?

Best regards,
Gosha
 
Last edited:
I still can not understand what is happening with the cluster. :(
I see only some errors in the syslog. What is their meaning, I do not understand.
Example from syslog:

Apr 30 06:25:16 acn3 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="837" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Apr 30 06:28:17 acn3 corosync[967]: [MAIN ] Corosync main process was not scheduled for 2530.0698 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Apr 30 06:28:17 acn3 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]
Apr 30 06:28:17 acn3 rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ]
Apr 30 06:28:17 acn3 corosync[967]: [TOTEM ] A processor failed, forming new configuration.
Apr 30 06:28:17 acn3 corosync[967]: [TOTEM ] A new membership (192.168.0.220:7424) was formed. Members
Apr 30 06:28:17 acn3 corosync[967]: [QUORUM] Members[3]: 2 3 1
Apr 30 06:28:17 acn3 corosync[967]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 30 06:28:22 acn3 corosync[967]: [MAIN ] Corosync main process was not scheduled for 3727.6453 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Apr 30 06:28:22 acn3 corosync[967]: [TOTEM ] A processor failed, forming new configuration.
Apr 30 06:28:22 acn3 corosync[967]: [TOTEM ] A new membership (192.168.0.220:7432) was formed. Members joined: 2 1 left: 2 1
Apr 30 06:28:22 acn3 corosync[967]: [TOTEM ] Failed to receive the leave message. failed: 2 1
Apr 30 06:28:22 acn3 corosync[967]: [QUORUM] Members[3]: 2 3 1
Apr 30 06:28:22 acn3 corosync[967]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/acn1: -1
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/acn1: /var/lib/rrdcached/db/pve2-node/acn1: illegal attempt to update using time 1493515367 when last update time is 1493515417 (minimum one second step)
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/201: -1
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/201: /var/lib/rrdcached/db/pve2-vm/201: illegal attempt to update using time 1493515367 when last update time is 1493515697 (minimum one second step)
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/200: -1
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/200: /var/lib/rrdcached/db/pve2-vm/200: illegal attempt to update using time 1493515367 when last update time is 1493515697 (minimum one second step)
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/acn1/local: -1
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/acn1/local: /var/lib/rrdcached/db/pve2-storage/acn1/local: illegal attempt to update using time 1493515367 when last update time is 1493515417 (minimum one second step)
Apr 30 06:28:22 acn3 pmxcfs[942]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/acn1/ceph_stor: -1

Help me, please!
 
Hi!

Looks like I found the reason. This is pveproxy.

Apr 30 10:31:01 acn1 pveproxy[2299]: starting server
Apr 30 10:31:01 acn1 pveproxy[2299]: starting 3 worker(s)
Apr 30 10:31:01 acn1 pveproxy[2299]: worker 2300 started
Apr 30 10:31:01 acn1 pveproxy[2299]: worker 2301 started
Apr 30 10:31:01 acn1 pveproxy[2299]: worker 2302 started
May 1 06:26:34 acn1 systemd[1]: pveproxy.service stopping timed out. Terminating.
May 1 06:26:34 acn1 pveproxy[2299]: received signal TERM
May 1 06:26:34 acn1 pveproxy[2299]: server closing
May 1 06:26:34 acn1 pveproxy[2301]: worker exit
May 1 06:26:34 acn1 pveproxy[2299]: worker 2302 finished
May 1 06:26:34 acn1 pveproxy[2299]: worker 2300 finished
May 1 06:26:34 acn1 pveproxy[2299]: worker 2301 finished
May 1 06:26:34 acn1 pveproxy[2299]: server stopped
May 1 06:28:04 acn1 systemd[1]: pveproxy.service stop-sigterm timed out. Killing.
May 1 06:28:31 acn1 kernel: [71879.314981] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:28:31 acn1 kernel: [71879.315108] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:29:35 acn1 systemd[1]: pveproxy.service still around after SIGKILL. Ignoring.
May 1 06:30:31 acn1 kernel: [71999.314162] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:30:31 acn1 kernel: [71999.314275] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:31:05 acn1 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
May 1 06:32:31 acn1 kernel: [72119.313092] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:32:31 acn1 kernel: [72119.313206] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:32:35 acn1 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
May 1 06:32:35 acn1 systemd[1]: Unit pveproxy.service entered failed state.
May 1 06:34:05 acn1 systemd[1]: pveproxy.service start operation timed out. Terminating.
May 1 06:34:31 acn1 kernel: [72239.311826] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:34:31 acn1 kernel: [72239.311939] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:35:36 acn1 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
May 1 06:36:31 acn1 kernel: [72359.310457] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:36:31 acn1 kernel: [72359.310570] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:36:31 acn1 kernel: [72359.310611] INFO: task pveproxy:39265 blocked for more than 120 seconds.
May 1 06:36:31 acn1 kernel: [72359.310715] pveproxy D ffff8802b5b43df8 0 39265 1 0x00000004
May 1 06:37:06 acn1 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
May 1 06:37:06 acn1 systemd[1]: Unit pveproxy.service entered failed state.
May 1 06:38:31 acn1 kernel: [72479.308941] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:38:31 acn1 kernel: [72479.309053] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:38:31 acn1 kernel: [72479.309094] INFO: task pveproxy:39265 blocked for more than 120 seconds.
May 1 06:38:31 acn1 kernel: [72479.309199] pveproxy D ffff8802b5b43df8 0 39265 1 0x00000004
May 1 06:40:31 acn1 kernel: [72599.307364] INFO: task pveproxy:38131 blocked for more than 120 seconds.
May 1 06:40:31 acn1 kernel: [72599.307477] pveproxy D ffff8802cb747df8 0 38131 1 0x00000004
May 1 06:40:31 acn1 kernel: [72599.307519] INFO: task pveproxy:39265 blocked for more than 120 seconds.
May 1 06:40:31 acn1 kernel: [72599.307624] pveproxy D ffff8802b5b43df8 0 39265 1 0x00000004
May 1 06:46:07 acn1 systemd[1]: pveproxy.service start operation timed out. Terminating.
May 1 06:47:38 acn1 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
May 1 06:49:08 acn1 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
May 1 06:49:08 acn1 systemd[1]: Unit pveproxy.service entered failed state.

But I can not understand why this is happening. o_O
Help me, please!

Best regards,
Gosha