Cluster running, but node shows offline

TechLineX · Oct 24, 2016

Since about 12 o clock, 2 nodes shows to be offline. But the maschine is running. Any ideas?

pvecm status shows all 3 nodes.

All VMs seems to be online..

TechLineX · Oct 24, 2016

I looked in the daemon.log:

Code:

Oct 24 12:19:26 host17 corosync[1679]:  [TOTEM ] A processor failed, forming new configuration.
Oct 24 12:19:28 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62830->[149.202.197.68]:161
Oct 24 12:19:29 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:432) was formed. Members left: 3 2
Oct 24 12:19:29 host17 corosync[1679]:  [TOTEM ] Failed to receive the leave message. failed: 3 2
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656
Oct 24 12:19:29 host17 pmxcfs[1656]: [status] notice: members: 1/1656
Oct 24 12:19:29 host17 corosync[1679]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 24 12:19:29 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:29 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:19:29 host17 pmxcfs[1656]: [status] notice: node lost quorum
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] crit: received write while not quorate - trigger resync
Oct 24 12:19:29 host17 pmxcfs[1656]: [dcdb] crit: leaving CPG group
Oct 24 12:19:29 host17 pve-ha-lrm[1715]: unable to write lrm status file - unable to open file '/etc/pve/nodes/host17/lrm_status.tmp.1715' - Permission denied
Oct 24 12:19:30 host17 pmxcfs[1656]: [dcdb] notice: start cluster connection
Oct 24 12:19:33 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62836->[149.202.197.68]:161
Oct 24 12:19:35 host17 rrdcached[1526]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/host17/Backup17) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/host17/Backup17: illegal attempt to update using time 1477304081 when last update time is 1477304342 (minimum one second step))
Oct 24 12:19:38 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62842->[149.202.197.68]:161
Oct 24 12:19:40 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:468) was formed. Members
Oct 24 12:19:40 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:40 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:19:40 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656
Oct 24 12:19:40 host17 pmxcfs[1656]: [dcdb] notice: all data is up to date
Oct 24 12:19:43 host17 snmpd[16604]: Connection from UDP: [212.48.109.169]:62849->[149.202.197.68]:161
Oct 24 12:19:46 host17 corosync[1679]:  [MAIN  ] Corosync main process was not scheduled for 3024.7622 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Oct 24 12:19:51 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.17:492) was formed. Members
Oct 24 12:19:51 host17 corosync[1679]:  [QUORUM] Members[1]: 1
Oct 24 12:19:51 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:20:00 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.12:516) was formed. Members joined: 3 2
Oct 24 12:20:02 host17 corosync[1679]:  [TOTEM ] A processor failed, forming new configuration.
Oct 24 12:20:03 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 10
Oct 24 12:20:04 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 20
Oct 24 12:20:05 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 30
Oct 24 12:20:06 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 40
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: cpg_send_message retry 50
Oct 24 12:20:07 host17 corosync[1679]:  [TOTEM ] A new membership (172.16.0.12:536) was formed. Members
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: starting data syncronisation
Oct 24 12:20:07 host17 corosync[1679]:  [QUORUM] This node is within the primary component and will provide service.
Oct 24 12:20:07 host17 corosync[1679]:  [QUORUM] Members[3]: 3 2 1
Oct 24 12:20:07 host17 corosync[1679]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: cpg_send_message retried 56 times
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: cpg_send_message retried 1 times
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: node has quorum
Oct 24 12:20:07 host17 pmxcfs[1656]: [dcdb] notice: members: 1/1656, 2/1793, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: members: 1/1656, 3/2634
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: starting data syncronisation
Oct 24 12:20:07 host17 pmxcfs[1656]: [status] notice: members: 1/1656, 2/1793, 3/2634
Oct 24 12:20:08 host17 pmxcfs[1656]: [dcdb] notice: received sync request (epoch 1/1656/00000020)
Oct 24 12:20:08 host17 pmxcfs[1656]: [dcdb] notice: received sync request (epoch 1/1656/00000021)
Oct 24 12:20:08 host17 pmxcfs[1656]: [status] notice: received sync request (epoch 1/1656/0000001C)
Oct 24 12:20:09 host17 pmxcfs[1656]: [status] notice: received sync request (epoch 1/1656/0000001D)

pvecm status:

Code:

pvecm status
Quorum information
------------------
Date:             Mon Oct 24 18:34:07 2016
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          3/536
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 IPHOST12
0x00000002          1 IPHOST16 (local)
0x00000001          1 IPHOST17

Now I get errors like this:

Code:

Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:04 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message
Oct 24 18:21:05 host17 corosync[1679]:  [CPG   ] Unknown node -> we will not deliver message

wolfgang · Oct 25, 2016

Hi,

the error with no home dir is not normal.
I think this is the source of your problems.
Did you change some system files?

pmteam · Oct 25, 2016

Hi,
we are running a proxmox cluster for near a year without trouble (also an upgrade from 3.4 to 4.1)
Last august we have upgraded successfully the cluster to 4.2-17 from previous 4.1.
From that time in two months we have experienced four trouble very similar to the one described above.
We have 6 nodes configured, up, votes therefore in goog shape apparently.
The web interface show all the nodes red but the one that logged in.
Furthermore there is no way to manage the running vm from the web interface.
We have used the usual cli command
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd
but the commands hang.
Today we have a new trouble with the same symptoms.
Previously we have decided finally to reboot the entire cluster to restore gui control but the user are very unhappy to stop working for 1-2 hour waiting for orderly reboot.
We suppose the problem it's related to the shared filesystem but we don't have found the root cause or a resolution method.
Can you suggest any ideas to solve the problem ?

Gui: red all node but one logged in; don't manage the running vm

#pveversion
pve-manager/4.2-17/e1400248 (running kernel: 4.4.15-1-pve)

# pvecm status
Quorum information
------------------
Date: Tue Oct 25 17:06:36 2016
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000005
Ring ID: 1/221952
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.110.51
0x00000002 1 192.168.110.52
0x00000003 1 192.168.110.53
0x00000004 1 192.168.110.54
0x00000005 1 192.168.110.55 (local)
0x00000006 1 192.168.110.56

Paul

isaak1983 · Oct 26, 2016

pmteam said:
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd
but the commands hang.

I have the same problem, and i think it's corosync. Now i use ugly workaround via Ansible (note to "pkill -9 corosync):

---

- hosts: pve
sudo: yes

tasks:

- name: Stop pve-cluster
service: name=pve-cluster state=stopped

- name: kill corosync
shell: pkill -9 corosync

- name: Restart pve
service: name={{ item }} state=restarted
with_items:
- pve-cluster
- pvedaemon
- pvestatd
- pveproxy

TechLineX · Oct 29, 2016

Is there a methode to restart corosync?

If the cluster think the server is offline to resync? - Actually i have to restart the whole server.

TechLineX · Nov 21, 2016

push. Got the error again

Amori · Nov 22, 2016

Just to let you know.. I have this problem as well.
https://www.dropbox.com/s/vh3uzo15icrjv4k/Skärmklipp 2016-11-22 01.15.00.png?dl=0

Also all tasks are hanging.. backup not working...

Code:

()
Package versions
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-6
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80

spirit · Nov 22, 2016

TechLineX said:
Is there a methode to restart corosync?

If the cluster think the server is offline to resync? - Actually i have to restart the whole server.

#killall -9 corosync
# systemctl restart pve-cluster

TechLineX · Nov 22, 2016

spirit said:
#killall -9 corosync
# systemctl restart pve-cluster

At the server shown as offline?

Amori · Nov 22, 2016

spirit said:
#killall -9 corosync
# systemctl restart pve-cluster

That command did not bring the server back but it synced all data between them.

I can bring the server back by this command service pvestatd restart.

Server will be back online a while then it goes to offline status.

czechsys · Nov 22, 2016

Confirming issue. Lost comunication - even nagios checks against snmpd was unresponsive. Pvecm nodes/info was showing all is ok. Corosync's restart synced, pve-cluster restarted, there was some "endpoint not connected" messages, so i fully restarted server.

proxmox-ve 4.3.7-1
pve-manager 4.3.-10/7230e60f (running kernel: 4.4.21-1-pve)
corosync-pve 2.4.0-1

Bidi · Oct 9, 2018

Hello guys, i got the same problem on proxmox 5

any solutions to fix this ? I got feeling this will happed again and again and i dont like to stay all the time thinking when it will happen again

Bidi · Nov 4, 2018

Hy guys any solutions ? I started to have this problem weekly but now every 2 days

DiegodelaFuente · Sep 5, 2019

root@Marte:~# killall -9 corosync
root@Marte:~# systemctl restart pve-cluster

works for me in the offline node

Search

Search

Cluster running, but node shows offline

TechLineX

Active Member

TechLineX

Active Member

wolfgang

Proxmox Retired Staff

pmteam

New Member

isaak1983

Active Member

TechLineX

Active Member

TechLineX

Active Member

Amori

Active Member

spirit

Distinguished Member

TechLineX

Active Member

Amori

Active Member

czechsys

Renowned Member

Bidi

Renowned Member

Bidi

Renowned Member

DiegodelaFuente

Active Member

We value your privacy