Proxmox 4.0 - VM locked cluster problem

wildbill442 · Oct 28, 2015

I'm getting some odd behavior. I have a six node cluster when I log into the proxmox GUI the datacenter view only shows one of the six nodes online (whichever I'm logged into). The others show red/offline.

I'm have two VM's that are offline and I'm unable to start due to them being locked from a backup.

When I issue qm unlock <VMID> the command just hangs and I'm unable to escape out of it. I have to start a new SSH session to get back to the prompt.

I'm assuming there's something wrong with the cluster but proxmox is reporting all nodes online, but the GUI only shows the node active that you've logged into.

I've tried rebooting the individual nodes, restarting the pve-cluster service, ensuring clocks are sync'd on all the servers... I'm not sure what else to do as proxmox believes the cluster is active...

I'd post relevant logs and command output but being that I'm new to this forum its blocking the content as it thinks it contains URLs... Maybe a moderator can remove this for my account?

wildbill442 · Oct 28, 2015

To add some additional context -- this was a proxmox 3.x cluster that I upgraded to 4.0 following the wiki instructions. It has been working for about a month until this morning when this problem arose.

wildbill442 · Oct 28, 2015

pvecm status:

root@c2-node1:~# pvecm status
Quorum information
------------------
Date: Wed Oct 28 06:22:03 2015
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 2252
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 10.1.1.10
0x00000005 1 10.1.1.11
0x00000004 1 10.1.1.12
0x00000001 1 10.1.1.13 (local)
0x00000002 1 10.1.1.14
0x00000003 1 10.1.1.15

Here's output of /var/log/syslog on c2-node1

Oct 28 06:25:02 c2-node1 pveproxy[18569]: proxy detected vanished client connection
Oct 28 06:25:32 c2-node1 pveproxy[18569]: proxy detected vanished client connection
Oct 28 06:26:03 c2-node1 pveproxy[18569]: proxy detected vanished client connection
Oct 28 06:26:33 c2-node1 systemd[1]: pveproxy.service stopping timed out. Terminating.
Oct 28 06:26:33 c2-node1 pveproxy[1974]: received signal TERM
Oct 28 06:26:33 c2-node1 pveproxy[1974]: server closing
Oct 28 06:26:34 c2-node1 pveproxy[18569]: proxy detected vanished client connection
Oct 28 06:26:34 c2-node1 pveproxy[18569]: worker exit
Oct 28 06:26:34 c2-node1 pveproxy[1974]: worker 18569 finished
Oct 28 06:26:43 c2-node1 pveproxy[1974]: error stopping workers (will kill them now) - timeout
Oct 28 06:26:43 c2-node1 pveproxy[1974]: kill worker 20509
Oct 28 06:26:43 c2-node1 pveproxy[1974]: kill worker 16997
Oct 28 06:26:43 c2-node1 pveproxy[1974]: server stopped

t.lamprecht · Oct 28, 2015

The other nodes are okay?

The log doesn't really help, a bit more information would be helpful.

Did you restart pve-cluster, pvestatd, pveproxy, pve-manager? (e.g.: with `systemctl restart pve-cluster` )

wildbill442 · Oct 28, 2015

I'm seeing a lot of cpg_send_message retried on all of the nodes in the cluster. Now I can't even login to management GUI, but the VM's are still running with the exception of those that are locked.

I don't really want to do anything at the moment as its during business hours and we have some crucial services on these nodes.

Code:

Oct 28 07:48:20 c2-node1 corosync[1880]:  [MAIN  ] Q empty, queued:0 sent:39263.
Oct 28 07:48:21 c2-node1 pmxcfs[1583]: [status] notice: received sync request (epoch 1/1583/00000015)
Oct 28 07:49:57 c2-node1 pmxcfs[1583]: [dcdb] notice: members: 1/1583, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:49:57 c2-node1 pmxcfs[1583]: [status] notice: members: 1/1583, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:49:57 c2-node1 pmxcfs[1583]: [dcdb] notice: received sync request (epoch 1/1583/00000016)
Oct 28 07:49:57 c2-node1 pmxcfs[1583]: [status] notice: received sync request (epoch 1/1583/00000016)
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [dcdb] notice: members: 1/1583, 2/27857, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [dcdb] notice: queue not emtpy - resening 9 messages
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [dcdb] notice: received sync request (epoch 1/1583/00000017)
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [status] notice: members: 1/1583, 2/27857, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [status] notice: queue not emtpy - resening 41337 messages
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:57 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:58 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 2 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:51:59 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: cpg_send_message retried 1 times
Oct 28 07:52:00 c2-node1 corosync[1880]:  [MAIN  ] Q empty, queued:0 sent:40331.
Oct 28 07:52:00 c2-node1 pmxcfs[1583]: [status] notice: received sync request (epoch 1/1583/00000017)
Oct 28 07:53:37 c2-node1 pmxcfs[1583]: [dcdb] notice: members: 1/1583, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:53:37 c2-node1 pmxcfs[1583]: [status] notice: members: 1/1583, 3/2066, 4/1152, 5/2406, 6/1252
Oct 28 07:53:37 c2-node1 pmxcfs[1583]: [dcdb] notice: received sync request (epoch 1/1583/00000018)
Oct 28 07:53:37 c2-node1 pmxcfs[1583]: [status] notice: received sync request (epoch 1/1583/00000018)
Oct 28 08:05:58 c2-node1 systemd-timesyncd[1364]: interval/delta/delay/jitter/drift 2048s/-0.000s/0.086s/0.006s/-7ppm
Oct 28 08:07:38 c2-node1 rrdcached[1841]: flushing old values
Oct 28 08:07:38 c2-node1 rrdcached[1841]: rotating journals
Oct 28 08:07:38 c2-node1 rrdcached[1841]: started new journal /var/lib/rrdcached/journal/rrd.journal.1446044858.508081
Oct 28 08:07:38 c2-node1 rrdcached[1841]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1446037658.508013
Oct 28 08:17:01 c2-node1 CRON[27436]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 28 08:40:07 c2-node1 systemd-timesyncd[1364]: interval/delta/delay/jitter/drift 2048s/+0.004s/0.086s/0.006s/-7ppm
Oct 28 09:07:38 c2-node1 rrdcached[1841]: flushing old values
Oct 28 09:07:38 c2-node1 rrdcached[1841]: rotating journals
Oct 28 09:07:38 c2-node1 rrdcached[1841]: started new journal /var/lib/rrdcached/journal/rrd.journal.1446048458.508048
Oct 28 09:07:38 c2-node1 rrdcached[1841]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1446041258.508075
Oct 28 09:14:15 c2-node1 systemd-timesyncd[1364]: interval/delta/delay/jitter/drift 2048s/+0.002s/0.088s/0.005s/-7ppm
Oct 28 09:17:02 c2-node1 CRON[19014]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 28 09:48:23 c2-node1 systemd-timesyncd[1364]: interval/delta/delay/jitter/drift 2048s/+0.008s/0.099s/0.005s/-5ppm

wildbill442 · Oct 29, 2015

OK.. well I just powered off/on all the nodes in the cluster -- had to physically power down some of them because they were hanging when closing PVE-* services.

Now everything is back online and operational... Not sure what caused this -- it appears it may have stemmed from a backup that locked the VMs on October 26 and was either interrupted or caused some error to not complete successfully.

t.lamprecht · Oct 30, 2015

It would be a bit strange if the cluster filesystem (and other service) would hang from a failed backup.

The retries may indicate a bad behaving corosync on one node. Restarting (or when no other way killing) the corosync process should help with that.

The network where corosync communicates is fine? No heavy traffic like storage or there like?

wildbill442 · Oct 30, 2015

t.lamprecht said:
It would be a bit strange if the cluster filesystem (and other service) would hang from a failed backup.

The retries may indicate a bad behaving corosync on one node. Restarting (or when no other way killing) the corosync process should help with that.

The network where corosync communicates is fine? No heavy traffic like storage or there like?

None really... This deployment is more of a test bed to replace an old parrellels Virtuozzo environment that's in production. So there's very little traffic with the exception of nightly backups made to a freebsd NAS (which replicates to a second NAS) around 1AM.

This morning the cluster is back to offline however none of the VM's are locked.

Code:

root@c2-node1:~# pvecm status
Quorum information
------------------
Date:             Fri Oct 30 07:49:15 2015
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          3484
Quorate:          No

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      1
Quorum:           4 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.1.13 (local)

Code:

root@c2-node1:/var/log# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: active (running) since Fri 2015-10-30 08:01:33 PDT; 11min ago
  Process: 1608 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 1588 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 1606 (pmxcfs)
   CGroup: /system.slice/pve-cluster.service
           └─1606 /usr/bin/pmxcfs

Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [dcdb] notice: start sending inode updates
Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [dcdb] notice: sent all (10) updates
Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [dcdb] notice: all data is up to date
Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [status] notice: received sync request (epoch 1/1606/00000004)
Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [status] notice: received all states
Oct 30 08:03:01 c2-node1 pmxcfs[1606]: [status] notice: all data is up to date
Oct 30 08:03:02 c2-node1 pmxcfs[1606]: [status] notice: received log
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [dcdb] notice: members: 1/1606
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [status] notice: members: 1/1606
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [status] notice: node lost quorum

After rebooting all the nodes in the cluster the quorum is formed and then quickly lost:

Code:

Oct  30 08:03:16 c2-node1 corosync[1929]:  [TOTEM ] Retransmit List: 2b7 2c0  2c1 2c2 2c3 2c4 2c5 2c6 2c7 2c8 2c9 2ca 2cb 2cc 2b6 2b8 2b9 2a9 2aa 2ab  2ac 2ad 2ae 2af 2b0 2b2 2b3 2b4 2b5 2ba
Oct 30 08:03:16 c2-node1  corosync[1929]:  [TOTEM ] Retransmit List: 2c0 2c1 2c2 2c3 2c4 2c5 2c6  2c7 2c8 2c9 2ca 2cb 2b6 2b8 2b9 2a9 2aa 2ab 2ac 2ad 2ae 2af 2b0 2b2 2b3  2b4 2b5 2b7 2ba 2bb
Oct 30 08:03:16 c2-node1 corosync[1929]:  [TOTEM ]  Retransmit List: 2b2 2c0 2c1 2c2 2c3 2c4 2c5 2c6 2c7 2c8 2c9 2ca 2cb  2cc 2b6 2b8 2b9 2a9 2aa 2ab 2ac 2ad 2ae 2af 2b0 2b3 2b4 2b5 2b7 2ba
Oct  30 08:03:16 c2-node1 corosync[1929]:  [TOTEM ] Retransmit List: 2b7 2c0  2c1 2c2 2c3 2c4 2c5 2c6 2c7 2c8 2c9 2ca 2cb 2cc 2b6 2b8 2b9 2a9 2aa 2ab  2ac 2ad 2ae 2af 2b0 2b2 2b3 2b4 2b5 2ba
Oct 30 08:03:16 c2-node1  corosync[1929]:  [TOTEM ] Retransmit List: 2c0 2c1 2c2 2c3 2c4 2c5 2c6  2c7 2c8 2c9 2ca 2cb 2b6 2b8 2b9 2a9 2aa 2ab 2ac 2ad 2ae 2af 2b0 2b2 2b3  2b4 2b5 2b7 2ba 2bb
Oct 30 08:03:16 c2-node1 corosync[1929]:  [TOTEM ]  Retransmit List: 2b2 2c0 2c1 2c2 2c3 2c4 2c5 2c6 2c7 2c8 2c9 2ca 2cb  2cc 2b6 2b8 2b9 2a9 2aa 2ab 2ac 2ad 2ae 2af 2b0 2b3 2b4 2b5 2b7 2ba
Oct 30 08:03:20 c2-node1 corosync[1929]:  [TOTEM ] A processor failed, forming new configuration.
Oct 30 08:03:24 c2-node1 corosync[1929]:  [TOTEM ] A new membership (10.1.1.13:11048) was formed. Members left: 6 5 4 2 3
Oct 30 08:03:24 c2-node1 corosync[1929]:  [TOTEM ] Failed to receive the leave message. failed: 6 5 4 2 3
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [dcdb] notice: members: 1/1606
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [status] notice: members: 1/1606
Oct 30 08:03:24 c2-node1 pmxcfs[1606]: [status] notice: node lost quorum
Oct  30 08:03:24 c2-node1 corosync[1929]:  [QUORUM] This node is within the  non-primary component and will NOT provide any services.
Oct 30 08:03:24 c2-node1 corosync[1929]:  [QUORUM] Members[1]: 1
Oct 30 08:03:24 c2-node1 corosync[1929]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct  30 08:03:24 c2-node1 pve-ha-lrm[2011]: unable to write lrm status file -  closing file '/etc/pve/nodes/c2-node1/lrm_status.tmp.2011' failed -  Permission denied
Oct 30 08:03:25 c2-node1 pvedaemon[2390]: start VM 109: UPID:c2-node1:00000956:000032A4:563386BD:qmstart:109:root@pam:
Oct  30 08:03:25 c2-node1 pvedaemon[1996]: <root@pam> starting task  UPID:c2-node1:00000956:000032A4:563386BD:qmstart:109:root@pam:
Oct 30 08:03:26 c2-node1 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
Oct 30 08:03:26 c2-node1 kernel: [  131.445871] device tap109i0 entered promiscuous mode
Oct 30 08:03:26 c2-node1 kernel: [  131.451227] vmbr0: port 2(tap109i0) entered forwarding state
Oct 30 08:03:26 c2-node1 kernel: [  131.451236] vmbr0: port 2(tap109i0) entered forwarding state
Oct 30 08:03:28 c2-node1 pvedaemon[2423]: start VM 110: UPID:c2-node1:00000977:000033B2:563386C0:qmstart:110:root@pam:
Oct  30 08:03:28 c2-node1 pvedaemon[1995]: <root@pam> starting task  UPID:c2-node1:00000977:000033B2:563386C0:qmstart:110:root@pam:
Oct 30 08:03:28 c2-node1 kernel: [  133.266477] device tap110i0 entered promiscuous mode
Oct 30 08:03:28 c2-node1 kernel: [  133.271719] vmbr0: port 3(tap110i0) entered forwarding state
Oct 30 08:03:28 c2-node1 kernel: [  133.271728] vmbr0: port 3(tap110i0) entered forwarding state
Oct 30 08:03:29 c2-node1 kernel: [  134.359021] kvm: zapping shadow pages for mmio generation wraparound
Oct 30 08:03:29 c2-node1 kernel: [  134.359637] kvm: zapping shadow pages for mmio generation wraparound
Oct 30 08:03:30 c2-node1 pvedaemon[2457]: start VM 111: UPID:c2-node1:00000999:0000347C:563386C2:qmstart:111:root@pam:
Oct  30 08:03:30 c2-node1 pvedaemon[1994]: <root@pam> starting task  UPID:c2-node1:00000999:0000347C:563386C2:qmstart:111:root@pam:
Oct 30 08:03:30 c2-node1 kernel: [  135.308871] device tap111i0 entered promiscuous mode
Oct 30 08:03:30 c2-node1 kernel: [  135.313814] vmbr0: port 4(tap111i0) entered forwarding state
Oct 30 08:03:30 c2-node1 kernel: [  135.313823] vmbr0: port 4(tap111i0) entered forwarding state

t.lamprecht · Oct 30, 2015

So if it's a test bed do you use another Virtualization technology to host the test cluster or does it run on bare metal? There are some problems on some hypervisors, especially with multicast communication, AFAIK.

It's only on that node? Restarting the corosync service does nothing?

Honestly it's a bit hard to tell what could be cause of this, if it's only one node a re-installation could be done (if It's a test-bed it should be possible without bigger annoyances).

wildbill442 · Oct 30, 2015

t.lamprecht said:
So if it's a test bed do you use another Virtualization technology to host the test cluster or does it run on bare metal? There are some problems on some hypervisors, especially with multicast communication, AFAIK.

It's only on that node? Restarting the corosync service does nothing?

Honestly it's a bit hard to tell what could be cause of this, if it's only one node a re-installation could be done (if It's a test-bed it should be possible without bigger annoyances).

Its 6 bare metal servers all running proxmox-4.0 with latest release from http://download.proxmox.com/debian jessie pve-no-subscription

.... Its a test bed with a few servers running on it mainly a Management/Monitoring server for our network and a windows server for quickbooks and basic file sharing... The quickbooks server limits my ability to do things during business hours.

All nodes only show the local node as up... Its like the multicast packets aren't reaching the nodes. omping shows otherwise.. And why would the quorum be established for a short duration and then be lost? I wish there was more information in the logs...

wildbill442 · Oct 30, 2015

I may just end up formatting and reinstalling proxmox 4.x from scratch... This was upgraded from 3.x and was working for a few weeks up until now.. Not sure what the catalyst was that caused these issues, but at this point it may make more sense to start with a fresh install.

dietmar · Oct 30, 2015

wildbill442 said:
And why would the quorum be established for a short duration and then be lost? I wish there was more information in the logs...

Because multicast querier is not active/working?

wildbill442 · Nov 2, 2015

dietmar said:
Because multicast querier is not active/working?

OK I feel stupid... It was definitely a multicast issue, it looks like the PIM package on my router was removed either through an update or someone thought it was unnecessary. I re-added the package to the router and enabled PIM on the interface facing my server LAN and the quorum was immediately reestablished.

I know the router and switch were both upgraded recently (Mikrotik CCR1016-12G, and CRS226-24G-2S+ respectively) to ROS v.6.30.4.. The tech that did the upgrade probably didn't upload the multicast package.

Anyway this is resolved and I didn't have to rebuild the cluster.

Search

Search

Proxmox 4.0 - VM locked cluster problem

wildbill442

Active Member

wildbill442

Active Member

wildbill442

Active Member

t.lamprecht

Proxmox Staff Member

wildbill442

Active Member

wildbill442

Active Member

t.lamprecht

Proxmox Staff Member

wildbill442

Active Member

t.lamprecht

Proxmox Staff Member

wildbill442

Active Member

wildbill442

Active Member

dietmar

Proxmox Staff Member

wildbill442

Active Member