Understanding why a node went down

Klas · Sep 2, 2014

Hello,

We had a node in our lab cluster go down during the night for some reason.
I am trying to determine why.

There was an automated backup running at that time, that has never failed before, and it didn't fail on the other nodes.
VM 124 was not the node being backuped, that was VM 114.

The possibly relevant lines from syslog for that period are:
Sep 2 01:16:07 virt2 rrdcached[3449]: flushing old values
Sep 2 01:16:07 virt2 rrdcached[3449]: rotating journals
Sep 2 01:16:07 virt2 rrdcached[3449]: started new journal /var/lib/rrdcached/journal/rrd.journal.1409613367.170428
Sep 2 01:16:07 virt2 rrdcached[3449]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1409606167.170378
Sep 2 01:17:01 virt2 /USR/SBIN/CRON[434497]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Sep 2 01:17:37 virt2 pmxcfs[81604]: [status] notice: received log
Sep 2 01:17:58 virt2 pvedaemon[427752]: WARNING: unable to connect to VM 124 socket - timeout after 31 retries
Sep 2 01:18:03 virt2 pvestatd[81678]: WARNING: unable to connect to VM 124 socket - timeout after 31 retries
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] CLM CONFIGURATION CHANGE
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] New Configuration:
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.2)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] Members Left:
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.1)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.3)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] Members Joined:
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[2]: 2 3
Sep 2 01:18:05 virt2 corosync[3690]: [CMAN ] quorum lost, blocking activity
Sep 2 01:18:05 virt2 pmxcfs[81604]: [status] notice: node lost quorum
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[1]: 2
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] CLM CONFIGURATION CHANGE
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] New Configuration:
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.1)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.2)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.3)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] Members Left:
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] Members Joined:
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.1)
Sep 2 01:18:05 virt2 corosync[3690]: [CLM ] #011r(0) ip(192.168.#.3)
Sep 2 01:18:05 virt2 rgmanager[6749]: #1: Quorum Dissolved
Sep 2 01:18:05 virt2 corosync[3690]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 2 01:18:05 virt2 kernel: dlm: closing connection to node 1
Sep 2 01:18:05 virt2 kernel: dlm: closing connection to node 3
Sep 2 01:18:05 virt2 corosync[3690]: [CMAN ] quorum regained, resuming activity
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] This node is within the primary component and will provide service.
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[2]: 2 3
Sep 2 01:18:05 virt2 pmxcfs[81604]: [status] notice: node has quorum
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[2]: 2 3
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[3]: 1 2 3
Sep 2 01:18:05 virt2 corosync[3690]: [QUORUM] Members[3]: 1 2 3
Sep 2 01:18:05 virt2 corosync[3690]: [CPG ] chosen downlist: sender r(0) ip(192.168.#.1) ; members(old:2 left:0)
Sep 2 01:18:05 virt2 corosync[3690]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 01:18:05 virt2 pvestatd[81678]: status update time (5.565 seconds)
Sep 2 01:18:06 virt2 pvevm: <root@pam> starting task UPID:virt2:0006A1EB:06652AAF:5404FEAE:qmshutdown:100:root@pam:
Sep 2 01:18:06 virt2 task UPID:virt2:0006A1EB:06652AAF:5404FEAE:qmshutdown:100:root@pam:: shutdown VM 100: UPID:virt2:0006A1EB:06652AAF:5404FEAE:qmshutdown:100:root@pam:
Sep 2 01:18:07 virt2 rgmanager[434668]: [pvevm] Task still active, waiting

So, any ideas?

Klas · Sep 9, 2014

It happened again, but with a different node.

Yet again I fail to understand why.

It happened during backups again.

It all seemed to start with:
Sep 9 00:00:55 virt1 corosync[3691]: [TOTEM ] A processor failed, forming new configuration.

And then it went down, was fenced and all HA machines except one was migrated.
The one exception was the locked machine as that was the one which was currently in backup when the crash occured.

Search

Search

Understanding why a node went down

Klas

New Member

Klas

New Member

We value your privacy