Problem with cluster after last update

hape · Sep 29, 2015

Hello all,

after the last update on all of our cluster-nodes (done last night), i cannot see one of the other nodes expect the one i'm connected directly.

At the same time i've updated some raid-controller firmware and software. I have copied the installation-sources from one to 3 others via scp. I got the message that the hostkeys differs from the stored one, and i updated them with ssh-keygen -R ....

On the first one i see that only the first and the second are registerd in the cluster.

How can i restore the cluster-configuration now?

Any idea?

hape · Sep 29, 2015

hape said:
Hello all,

after the last update on all of our cluster-nodes (done last night), i cannot see one of the other nodes expect the one i'm connected directly.

At the same time i've updated some raid-controller firmware and software. I have copied the installation-sources from one to 3 others via scp. I got the message that the hostkeys differs from the stored one, and i updated them with ssh-keygen -R ....

On the first one i see that only the first and the second are registerd in the cluster.

How can i restore the cluster-configuration now?

Any idea?

When i try to restart with "/etc/init.d/cman start i get the following:

-----
root@virtfarm-g1:/etc/pve/priv# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
-----

I don't understand what's happend.

hape · Sep 29, 2015

I now saw that the backups of the last night also failed.

What is wrong here??

The log of the failed backups:
-----
118: Sep 29 02:00:01 INFO: Starting Backup of VM 118 (qemu)
118: Sep 29 02:00:01 INFO: status = running
118: Sep 29 02:00:02 INFO: unable to open file '/etc/pve/nodes/virtfarm-g1/qemu-server/118.conf.tmp.22297' - Permission denied
118: Sep 29 02:00:02 INFO: update VM 118: -lock backup
118: Sep 29 02:00:02 ERROR: Backup of VM 118 failed - command 'qm set 118 --lock backup' failed: exit code 2

120: Sep 29 02:00:02 INFO: Starting Backup of VM 120 (qemu)
120: Sep 29 02:00:02 INFO: status = running
120: Sep 29 02:00:02 INFO: unable to open file '/etc/pve/nodes/virtfarm-g1/qemu-server/120.conf.tmp.22302' - Permission denied
120: Sep 29 02:00:02 INFO: update VM 120: -lock backup
120: Sep 29 02:00:02 ERROR: Backup of VM 120 failed - command 'qm set 120 --lock backup' failed: exit code 2

121: Sep 29 02:00:02 INFO: Starting Backup of VM 121 (qemu)
121: Sep 29 02:00:02 INFO: status = running
121: Sep 29 02:00:03 INFO: unable to open file '/etc/pve/nodes/virtfarm-g1/qemu-server/121.conf.tmp.22307' - Permission denied
121: Sep 29 02:00:03 INFO: update VM 121: -lock backup
121: Sep 29 02:00:03 ERROR: Backup of VM 121 failed - command 'qm set 121 --lock backup' failed: exit code 2

122: Sep 29 02:00:03 INFO: Starting Backup of VM 122 (qemu)
122: Sep 29 02:00:03 INFO: status = running
122: Sep 29 02:00:03 INFO: unable to open file '/etc/pve/nodes/virtfarm-g1/qemu-server/122.conf.tmp.22314' - Permission denied
122: Sep 29 02:00:03 INFO: update VM 122: -lock backup
122: Sep 29 02:00:03 ERROR: Backup of VM 122 failed - command 'qm set 122 --lock backup' failed: exit code 2

-----

t.lamprecht · Sep 29, 2015

hape said:
118: Sep 29 02:00:02 INFO: unable to open file '/etc/pve/nodes/virtfarm-g1/qemu-server/118.conf.tmp.22297' - Permission denied

You only can write to /etc/pve when you have quorum, and as that's not the case the lock mechanism needed for backups fails. That you have no quorum causes also the 'lost'/invisible nodes.

First did you made an clean restart on all nodes?
Can you ssh between the nodes, without password? That is needed for the cluster.
Note that all your node share the authorized_keys file from /root/.ssh/authorized_keys as its a symlink into the cluster filesystem /etc/pve/priv/authorized_keys

At the same time i've updated some raid-controller firmware and software. I have copied the installation-sources from one to 3 others via scp. I got the message that the hostkeys differs from the stored one, and i updated them with ssh-keygen -R ....

Why did the hostkey differ? That sounds strange to me, maybe I'm overlooking something.

hape · Sep 30, 2015

I yesterday made a lot of checks. When i start the "cman" and "pve-cluster" service on some of the 6 PMXs new, the cluster is completely coming up for 5-15 min.

On one of the failing hosts i later get the following logentries:

--
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
Sep 30 06:25:10 virtfarm-g3 pmxcfs[47236]: [status] crit: cpg_send_message failed: 9
--

Today in the morning i've started the "cman" and "pve-cluster" service on that host new, and the host is coming into the cluster. But now i get the logentries:

--
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 478 479 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 484 485 489 47d 47e 47f 480 481 482 483 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 478 479 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 484 485 489 47d 47e 47f 480 481 482 483 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 478 479 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
Sep 30 08:44:10 virtfarm-g3 corosync[121341]: [TOTEM ] Retransmit List: 47a 47b 47c 489 47d 47e 47f 480 481 482 483 484 485 486 487 488 48a
--

And a few minutes later the host becomes lost in the cluster again.

hape · Oct 6, 2015

Hello all,

i now have installed 2 new nodes in a complete new cluster step by step and with absolutely no fault. In the new cluster i get the same error as in the old one. Before the updates the cluster is running normaly. If i install all updates from the subscription and from debian, i get the same result. When i start the two nodes new the cluster is coming up for a couple of minutes.

The "Retransmit List:" messages are coming 1-2 minutes before the last message and the drop of the cluster.

After that i get the following message from the first node:

--------
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 295 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 294
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 293 294 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 295
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 295 28e 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 294
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 294 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 295
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 293 295 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 294
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 294 28e 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 295
Oct 6 14:03:13 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 295 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 294
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 293 294 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 295
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 295 28e 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 294
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 294 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 295
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 293 295 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 294
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 294 28e 28f 298 299 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 295
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] Retransmit List: 295 296 297 29a 27a 27b 27c 27d 27e 27f 285 286 287 288 289 28a 290 291 292 293 294
Oct 6 14:03:14 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] FAILED TO RECEIVE
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] New Configuration:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] #011r(0) ip(172.16.8.50)
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] Members Left:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] #011r(0) ip(172.16.8.51)
Oct 6 14:03:16 virtfarm-stpgp-1 pmxcfs[3721]: [status] notice: node lost quorum
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] Members Joined:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CMAN ] quorum lost, blocking activity
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [QUORUM] Members[1]: 1
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] New Configuration:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] #011r(0) ip(172.16.8.50)
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] Members Left:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CLM ] Members Joined:
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 6 14:03:16 virtfarm-stpgp-1 kernel: dlm: closing connection to node 2
Oct 6 14:03:16 virtfarm-stpgp-1 pmxcfs[3721]: [dcdb] notice: members: 1/3721
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [CPG ] chosen downlist: sender r(0) ip(172.16.8.50) ; members(old:2 left:1)
Oct 6 14:03:16 virtfarm-stpgp-1 pmxcfs[3721]: [dcdb] notice: members: 1/3721
Oct 6 14:03:16 virtfarm-stpgp-1 corosync[3849]: [MAIN ] Completed service synchronization, ready to provide service.
--------

and the following on the second node:

--------
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 296 297 29a 293 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 28f 298 299
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28f 298 299 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 296 297 29a
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28e 296 297 29a 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28f 298 299
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28f 298 299 293 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 296 297 29a
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 296 297 29a 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 28f 298 299
Oct 6 14:03:13 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28e 28f 298 299 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 296 297 29a
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 296 297 29a 293 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 28f 298 299
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28f 298 299 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 296 297 29a
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28e 296 297 29a 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28f 298 299
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28f 298 299 293 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 296 297 29a
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 296 297 29a 294 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 28e 28f 298 299
Oct 6 14:03:14 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] Retransmit List: 28e 28f 298 299 295 275 276 277 278 279 280 281 282 283 284 28b 28c 28d 296 297 29a
Oct 6 14:03:24 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] A processor failed, forming new configuration.
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] New Configuration:
Oct 6 14:03:26 virtfarm-stpgp-2 pmxcfs[3085]: [status] notice: node lost quorum
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] #011r(0) ip(172.16.8.51)
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Left:
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] #011r(0) ip(172.16.8.50)
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Joined:
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CMAN ] quorum lost, blocking activity
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [QUORUM] Members[1]: 2
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] New Configuration:
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] #011r(0) ip(172.16.8.51)
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Left:
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Joined:
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [CPG ] chosen downlist: sender r(0) ip(172.16.8.51) ; members(old:2 left:1)
Oct 6 14:03:26 virtfarm-stpgp-2 pmxcfs[3085]: [dcdb] notice: members: 2/3085
Oct 6 14:03:26 virtfarm-stpgp-2 corosync[3250]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 6 14:03:26 virtfarm-stpgp-2 pmxcfs[3085]: [dcdb] notice: members: 2/3085
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] New Configuration:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] #011r(0) ip(172.16.8.51)
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Left:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Joined:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] CLM CONFIGURATION CHANGE
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] New Configuration:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] #011r(0) ip(172.16.8.51)
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Left:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CLM ] Members Joined:
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [CPG ] chosen downlist: sender r(0) ip(172.16.8.51) ; members(old:1 left:0)
Oct 6 14:03:30 virtfarm-stpgp-2 corosync[3250]: [MAIN ] Completed service synchronization, ready to provide service.
--------

I'm completely confused. What can i do. Should i cancel the subscriptions?

Can anyone at proxmox give profesional support to us about that problem?

Regards

Hans-Peter

hape · Oct 13, 2015

Hello all,

i think i've solved the problem now by myself.

After the last update in 3.4 and also in 4.0 the communication of corosync does only work properly if multicast-communication over the standard-gateway can be sent through. The cluster was lost after the update and does only work again when i install a new 3.4 system without any subscription-updates.

After i've enabled multicast-communication on the switch/router in the local net, the cluster runs in every newer version.

We are using MikroTik Switch/Router in our network. We've had to install the corresponding (multicast-) package on the router and setup a igmp/quirier interface. After that all cluster-stuff works well.

It would be nice to put an extended information about that (multicast-) issue on the installation howtos.

Regards

Hans-Peter Straub

Search

Search

Problem with cluster after last update

hape

Renowned Member

hape

Renowned Member

hape

Renowned Member

t.lamprecht

Proxmox Staff Member

hape

Renowned Member

hape

Renowned Member

hape

Renowned Member

We value your privacy