Cluster running, but node shows offline

TechLineX

Active Member
Mar 2, 2015
213
5
38
Bgv9e.png


Since about 12 o clock, 2 nodes shows to be offline. But the maschine is running. Any ideas?

pvecm status shows all 3 nodes.

QLMfe.png


All VMs seems to be online..
 
Last edited:
I looked in the daemon.log:

Code:
Oct 24 12:19:26 host17 corosync[1679]:  [TOTEM ] A processor failed, forming new configuration.
Oct 24 12:19:28 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62830->[149.202.197.68]:161
Oct 24 12:19:29 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:432) was formed. Members left: 3 2
Oct 24 12:19:29 host17 corosync[1679]:  [TOTEM ] Failed to receive the leave message. failed: 3 2
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656
Oct 24 12:19:29 host17 pmxcfs[1656]: [status] notice: members: 1/1656
Oct 24 12:19:29 host17 corosync[1679]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 24 12:19:29 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:29 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:19:29 host17 pmxcfs[1656]: [status] notice: node lost quorum
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] crit: received write while not quorate - trigger resync
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] crit: leaving CPG group
Oct 24 12:19:29 host17 pve-ha-lrm[1715]: unable to write lrm status file - unable to open file '/etc/pve/nodes/host17/lrm_status.tmp.1715' - Permission denied
Oct 24 12:19:30 host17 pmxcfs[1656]: [dcdb] notice: start cluster connection
Oct 24 12:19:33 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62836->[149.202.197.68]:161
Oct 24 12:19:35 host17 rrdcached[1526]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/host17/Backup17) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/host17/Backup17: illegal attempt to update using time 1477304081 when last update time is 1477304342 (minimum one second step))
Oct 24 12:19:38 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62842->[149.202.197.68]:161
Oct 24 12:19:40 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:468) was formed. Members
Oct 24 12:19:40 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:40 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:19:40 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656
Oct 24 12:19:40 host17 pmxcfs[1656]: [dcdb] notice: all data is up to date
Oct 24 12:19:43 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62849->[149.202.197.68]:161
Oct 24 12:19:46 host17 corosync[1679]:  [MAIN  ] Corosync main process was not scheduled for 3024.7622 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Oct 24 12:19:51 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:492) was formed. Members
Oct 24 12:19:51 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:51 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:20:00 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.12:516) was formed. Members joined: 3 2
Oct 24 12:20:02 host17 corosync[1679]:  [TOTEM ] A processor failed, forming new configuration.
Oct 24 12:20:03 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 10
Oct 24 12:20:04 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 20
Oct 24 12:20:05 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 30
Oct 24 12:20:06 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 40
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 50
Oct 24 12:20:07 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.12:536) was formed. Members
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: starting data syncronisation
Oct 24 12:20:07 host17 corosync[1679]:  [QUORUM] This node is within the primary component and will provide service.
Oct 24 12:20:07 host17 corosync[1679]:  [QUORUM] Members[3]: 3 2 1
Oct 24 12:20:07 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: cpg_send_message retried 56 times
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: cpg_send_message retried 1 times
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: node has quorum
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656, 2/1793, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: members: 1/1656, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: starting data syncronisation
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: members: 1/1656, 2/1793, 3/2634
Oct 24 12:20:08 host17 pmxcfs[1656]: [dcdb] notice: received sync request (epoch 1/1656/00000020)
Oct 24 12:20:08 host17 pmxcfs[1656]: [dcdb] notice: received sync request (epoch 1/1656/00000021)
Oct 24 12:20:08 host17 pmxcfs[1656]: [status] notice: received sync request (epoch 1/1656/0000001C)
Oct 24 12:20:09 host17 pmxcfs[1656]: [status] notice: received sync request (epoch 1/1656/0000001D)

pvecm status:

Code:
pvecm status
Quorum information
------------------
Date:             Mon Oct 24 18:34:07 2016
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          3/536
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 IPHOST12
0x00000002          1 IPHOST16 (local)
0x00000001          1 IPHOST17

Now I get errors like this:

Code:
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:05 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
 
Hi,

the error with no home dir is not normal.
I think this is the source of your problems.
Did you change some system files?
 
Hi,
we are running a proxmox cluster for near a year without trouble (also an upgrade from 3.4 to 4.1)
Last august we have upgraded successfully the cluster to 4.2-17 from previous 4.1.
From that time in two months we have experienced four trouble very similar to the one described above.
We have 6 nodes configured, up, votes therefore in goog shape apparently.
The web interface show all the nodes red but the one that logged in.
Furthermore there is no way to manage the running vm from the web interface.
We have used the usual cli command
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd
but the commands hang.
Today we have a new trouble with the same symptoms.
Previously we have decided finally to reboot the entire cluster to restore gui control but the user are very unhappy to stop working for 1-2 hour waiting for orderly reboot.
We suppose the problem it's related to the shared filesystem but we don't have found the root cause or a resolution method.
Can you suggest any ideas to solve the problem ?

Gui: red all node but one logged in; don't manage the running vm

#pveversion
pve-manager/4.2-17/e1400248 (running kernel: 4.4.15-1-pve)

# pvecm status
Quorum information
------------------
Date: Tue Oct 25 17:06:36 2016
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000005
Ring ID: 1/221952
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.110.51
0x00000002 1 192.168.110.52
0x00000003 1 192.168.110.53
0x00000004 1 192.168.110.54
0x00000005 1 192.168.110.55 (local)
0x00000006 1 192.168.110.56

Paul
 
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd
but the commands hang.

I have the same problem, and i think it's corosync. Now i use ugly workaround via Ansible (note to "pkill -9 corosync):

---

- hosts: pve
sudo: yes

tasks:

- name: Stop pve-cluster
service: name=pve-cluster state=stopped

- name: kill corosync
shell: pkill -9 corosync

- name: Restart pve
service: name={{ item }} state=restarted
with_items:
- pve-cluster
- pvedaemon
- pvestatd
- pveproxy
 
Is there a methode to restart corosync?

If the cluster think the server is offline to resync? - Actually i have to restart the whole server.
 
Just to let you know.. I have this problem as well.
https://www.dropbox.com/s/vh3uzo15icrjv4k/Skärmklipp 2016-11-22 01.15.00.png?dl=0

Also all tasks are hanging.. backup not working...

Code:
()
Package versions
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-6
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
 
#killall -9 corosync
# systemctl restart pve-cluster
That command did not bring the server back but it synced all data between them.

I can bring the server back by this command service pvestatd restart.

Server will be back online a while then it goes to offline status.
 
Confirming issue. Lost comunication - even nagios checks against snmpd was unresponsive. Pvecm nodes/info was showing all is ok. Corosync's restart synced, pve-cluster restarted, there was some "endpoint not connected" messages, so i fully restarted server.

proxmox-ve 4.3.7-1
pve-manager 4.3.-10/7230e60f (running kernel: 4.4.21-1-pve)
corosync-pve 2.4.0-1
 
Hello guys, i got the same problem on proxmox 5 :( any solutions to fix this ? I got feeling this will happed again and again and i dont like to stay all the time thinking when it will happen again
 
Last edited:
Hy guys any solutions ? I started to have this problem weekly but now every 2 days
 
root@Marte:~# killall -9 corosync
root@Marte:~# systemctl restart pve-cluster

works for me in the offline node
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!