Node red after pmxcfs error

stef1777

Active Member
Jan 31, 2010
178
8
38
Hi Folks!

I have 2 nodes installed with PVE 2.3 for few days. All was fine until this morning. One of the node became red (the second installed node).

I've found this in sylog (plenty) in both node.

Mar 12 15:38:53 aweb-vs003 pmxcfs[8352]: [status] crit: cpg_send_message failed: 9

I restarted cman and rebooted the red node. I restarted cman on the second node too. No change. Still red.

Until the reboot, I no longer have the pmxcfs error but the node stay red.

Any idea?


Here is the version.

pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-18-pve
proxmox-ve-2.6.32: 2.3-88
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-48
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1
 
Last edited:
Hi Dietmar!

No change. I restarted both pvestatd and cman on both node.

/etc/pve/.members are not ok. Version is not the same.

I've not restarted the "first" node as it contains CT.
 
I had the same issue. restarting "/etc/init.d/pve-cluster restart" fixed the issue for me

This has happened 2 times since upgrading to 2.3
 
Last edited:
Thanks a lot. I tried and it worked. All nodes are now green.


I checked more on this before restarting /etc/init.d/pve-cluster.

The first node still send pmxcfs error 9. This node is not rebooted yet as it contains CT.

cman do not send error message at stop and start. But I still found pmxcfs error message in syslog for this node.

After restarting pve-cluster on this node, I've found this in syslog. pve-cluster take a long time to restart.

Mar 12 17:04:28 aweb-vs004 pmxcfs[3688]: [main] notice: teardown filesystem
Mar 12 17:04:29 aweb-vs004 pvedaemon[213257]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Mar 12 17:04:29 aweb-vs004 pvedaemon[213257]: WARNING: ipcc_send_rec failed: Connection refused

I have another cluster in 2.2 and never had this kind of problem.

- - - Updated - - -

Sorry Dietmar. Too late. I've done a pve-cluster restart on both node.

See my previous response. But if you need more details from my logs, don't hesitate.
 
Last edited:
Crashed again this morning.

Restarted pve-cluster, pvedaemon, pvestatd. Back to normal.
 
Hi Spirit!

No, I was busy yesterday. I discovered the system crashed again this morning. It's may be related to backup. I will be out this morning. I'll try to look more later today.

Here is a part of the syslog just when the system start to be broken.



Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: flushing old values
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: rotating journals
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: started new journal /var/lib/rrdcached/journal//rrd.journal.1363209003.512558
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: removing old journal /var/lib/rrdcached/journal//rrd.journal.1363201803.512520
Mar 13 22:11:50 aweb-vs004 pvestatd[537036]: status update time (40.931 seconds)
Mar 13 22:12:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:12:32 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:12:42 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:13:22 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:14:28 aweb-vs004 pvestatd[537036]: status update time (7.891 seconds)
Mar 13 22:14:52 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:17:01 aweb-vs004 /USR/SBIN/CRON[812109]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 13 22:18:16 aweb-vs004 pvestatd[537036]: status update time (6.541 seconds)
Mar 13 22:18:52 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:19:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:20:50 aweb-vs004 pvedaemon[536993]: <root@pam> successful auth for user 'root@pam'
Mar 13 22:24:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:13 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:22 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:57 aweb-vs004 pvestatd[537036]: status update time (27.383 seconds)
Mar 13 22:24:59 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:10 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:19 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:26:51 aweb-vs004 pvestatd[537036]: status update time (73.925 seconds)
Mar 13 22:27:56 aweb-vs004 pvestatd[537036]: status update time (65.352 seconds)
Mar 13 22:28:02 aweb-vs004 pvestatd[537036]: status update time (5.716 seconds)
Mar 13 22:28:18 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:38 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:48 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:16 aweb-vs004 pvestatd[537036]: status update time (10.056 seconds)
Mar 13 22:29:18 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:38 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:48 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:30:28 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:30:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:31:08 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:31:26 aweb-vs004 corosync[239953]: [TOTEM ] A processor failed, forming new configuration.
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 13 22:31:36 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f 10 16 17
Mar 13 22:31:36 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f
Mar 13 22:31:37 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:38 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:39 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:31:40 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:31:41 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:31:42 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:31:43 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:31:44 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:31:45 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:31:46 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f 19
Mar 13 22:31:46 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e f 19
Mar 13 22:31:47 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:48 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:49 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:31:50 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:31:51 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:31:52 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:31:53 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:31:54 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:31:55 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:31:55 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e f 19
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:31:57 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:58 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:58 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:31:59 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:00 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:00 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:32:01 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:02 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:03 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:32:03 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:04 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:05 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:07 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:08 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:09 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:10 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:11 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:12 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:13 aweb-vs004 corosync[239953]: [TOTEM ] A processor failed, forming new configuration.
Mar 13 22:32:13 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:14 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:15 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:17 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:18 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:19 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:20 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:21 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:22 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:23 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:24 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:25 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:25 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: b c d e 12 13 14 15
Mar 13 22:32:25 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:27 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:28 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:29 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:30 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:30 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:30 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:31 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:32 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:33 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:33 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:34 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:35 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:37 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:37 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e 12 13
Mar 13 22:32:38 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:39 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:40 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:41 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:42 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:43 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:44 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:45 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:45 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:47 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:48 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:49 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:50 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:51 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:52 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:52 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:53 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:54 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:54 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:55 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CPG ] chosen downlist: sender r(0) ip(62.210.166.17) ; members(old:2 left:0)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 13 22:33:32 aweb-vs004 pvestatd[537036]: status update time (116.680 seconds)
Mar 13 22:33:34 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout


I tried command 'df -P -B 1 /mnt/pve/Backup-VZDump' manualy. No problem. It's a NFS mount.

Found this in dmesg:

INFO: task lzop:990183 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lzop D ffff88010d6fc680 0 990183 990179 0 0x00000000
ffff8801473ff9c8 0000000000000082 ffff8801473ff990 0000000000000002
0000000000000003 00000000000003c8 ffff880028250610 ffff880200000000
ffff880028250600 000000012830c448 ffff88010d6fcc48 000000000001e9c0
Call Trace:
[<ffffffff81123d10>] ? sync_page+0x0/0x50
[<ffffffff8151cd93>] io_schedule+0x73/0xc0
[<ffffffff81123d4d>] sync_page+0x3d/0x50
[<ffffffff8151d75f>] __wait_on_bit+0x5f/0x90
[<ffffffff81123f83>] wait_on_page_bit+0x73/0x80
[<ffffffff81096c00>] ? wake_bit_function+0x0/0x40
[<ffffffff81126280>] grab_cache_page_write_begin+0xc0/0x150
[<ffffffffa0652477>] nfs_write_begin+0x77/0x230 [nfs]
[<ffffffff811248bb>] generic_file_buffered_write_iter+0x10b/0x2b0
[<ffffffff81126702>] __generic_file_write_iter+0x1a2/0x3b0
[<ffffffff81126995>] __generic_file_aio_write+0x85/0xa0
[<ffffffff81126a38>] generic_file_aio_write+0x88/0x100
[<ffffffffa06531fc>] nfs_file_write+0x10c/0x210 [nfs]
[<ffffffffa06530f1>] ? nfs_file_write+0x1/0x210 [nfs]
[<ffffffff81197c9a>] do_sync_write+0xfa/0x140
[<ffffffff81096bc0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
[<ffffffff81197f78>] vfs_write+0xb8/0x1a0
[<ffffffff8100ba8e>] ? common_interrupt+0xe/0x13
[<ffffffff81198871>] sys_write+0x51/0x90
[<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
 
Last edited:
Hi Spirit!

No, I was busy yesterday. I discovered the system crashed again this morning. It's may be related to backup. I will be out this morning. I'll try to look more later today.

Here is a part of the syslog just when the system start to be broken.



Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: flushing old values
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: rotating journals
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: started new journal /var/lib/rrdcached/journal//rrd.journal.1363209003.512558
Mar 13 22:10:03 aweb-vs004 rrdcached[1364]: removing old journal /var/lib/rrdcached/journal//rrd.journal.1363201803.512520
Mar 13 22:11:50 aweb-vs004 pvestatd[537036]: status update time (40.931 seconds)
Mar 13 22:12:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:12:32 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:12:42 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:13:22 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:14:28 aweb-vs004 pvestatd[537036]: status update time (7.891 seconds)
Mar 13 22:14:52 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:17:01 aweb-vs004 /USR/SBIN/CRON[812109]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 13 22:18:16 aweb-vs004 pvestatd[537036]: status update time (6.541 seconds)
Mar 13 22:18:52 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:19:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:20:50 aweb-vs004 pvedaemon[536993]: <root@pam> successful auth for user 'root@pam'
Mar 13 22:24:02 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:13 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:22 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:24:57 aweb-vs004 pvestatd[537036]: status update time (27.383 seconds)
Mar 13 22:24:59 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:10 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:19 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:25:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:26:51 aweb-vs004 pvestatd[537036]: status update time (73.925 seconds)
Mar 13 22:27:56 aweb-vs004 pvestatd[537036]: status update time (65.352 seconds)
Mar 13 22:28:02 aweb-vs004 pvestatd[537036]: status update time (5.716 seconds)
Mar 13 22:28:18 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:38 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:48 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:28:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:16 aweb-vs004 pvestatd[537036]: status update time (10.056 seconds)
Mar 13 22:29:18 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:29 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:38 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:48 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:29:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:30:28 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:30:58 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:31:08 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 13 22:31:26 aweb-vs004 corosync[239953]: [TOTEM ] A processor failed, forming new configuration.
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:31:29 aweb-vs004 corosync[239953]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 13 22:31:36 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f 10 16 17
Mar 13 22:31:36 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f
Mar 13 22:31:37 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:38 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:39 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:31:40 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:31:41 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:31:42 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:31:43 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:31:44 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:31:45 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:31:46 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:31:46 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e f 19
Mar 13 22:31:46 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e f 19
Mar 13 22:31:47 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:48 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:49 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:31:50 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:31:51 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:31:52 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:31:53 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:31:54 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:31:55 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:31:55 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e f 19
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:31:56 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:31:57 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:31:58 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:31:58 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:31:59 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:00 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:00 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:32:01 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:02 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:03 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e f 19
Mar 13 22:32:03 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:04 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:05 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:06 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:07 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:08 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:09 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:10 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:11 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:12 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:13 aweb-vs004 corosync[239953]: [TOTEM ] A processor failed, forming new configuration.
Mar 13 22:32:13 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:14 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:15 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:16 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:17 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:18 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:19 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:20 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:21 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:22 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:23 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:24 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:25 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:25 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: b c d e 12 13 14 15
Mar 13 22:32:25 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:26 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:27 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:28 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:29 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:30 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:30 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:30 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:31 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:32 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:33 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: c d e 12 13
Mar 13 22:32:33 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:34 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:35 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:36 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:37 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:37 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: d e 12 13
Mar 13 22:32:38 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:39 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:40 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:41 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:42 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:43 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:44 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:45 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:45 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:46 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:47 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 10
Mar 13 22:32:48 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 20
Mar 13 22:32:49 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 30
Mar 13 22:32:50 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 40
Mar 13 22:32:51 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 50
Mar 13 22:32:52 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:52 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 60
Mar 13 22:32:53 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 70
Mar 13 22:32:54 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 80
Mar 13 22:32:54 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:55 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 90
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retry 100
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] notice: cpg_send_message retried 100 times
Mar 13 22:32:56 aweb-vs004 pmxcfs[534005]: [status] crit: cpg_send_message failed: 6
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [TOTEM ] Retransmit List: e 12 13
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] CLM CONFIGURATION CHANGE
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] New Configuration:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.17)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] #011r(0) ip(62.210.166.24)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Left:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CLM ] Members Joined:
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [CPG ] chosen downlist: sender r(0) ip(62.210.166.17) ; members(old:2 left:0)
Mar 13 22:32:57 aweb-vs004 corosync[239953]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 13 22:33:32 aweb-vs004 pvestatd[537036]: status update time (116.680 seconds)
Mar 13 22:33:34 aweb-vs004 pvestatd[537036]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout


I tried command 'df -P -B 1 /mnt/pve/Backup-VZDump' manualy. No problem. It's a NFS mount.

Found this in dmesg:

INFO: task lzop:990183 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lzop D ffff88010d6fc680 0 990183 990179 0 0x00000000
ffff8801473ff9c8 0000000000000082 ffff8801473ff990 0000000000000002
0000000000000003 00000000000003c8 ffff880028250610 ffff880200000000
ffff880028250600 000000012830c448 ffff88010d6fcc48 000000000001e9c0
Call Trace:
[<ffffffff81123d10>] ? sync_page+0x0/0x50
[<ffffffff8151cd93>] io_schedule+0x73/0xc0
[<ffffffff81123d4d>] sync_page+0x3d/0x50
[<ffffffff8151d75f>] __wait_on_bit+0x5f/0x90
[<ffffffff81123f83>] wait_on_page_bit+0x73/0x80
[<ffffffff81096c00>] ? wake_bit_function+0x0/0x40
[<ffffffff81126280>] grab_cache_page_write_begin+0xc0/0x150
[<ffffffffa0652477>] nfs_write_begin+0x77/0x230 [nfs]
[<ffffffff811248bb>] generic_file_buffered_write_iter+0x10b/0x2b0
[<ffffffff81126702>] __generic_file_write_iter+0x1a2/0x3b0
[<ffffffff81126995>] __generic_file_aio_write+0x85/0xa0
[<ffffffff81126a38>] generic_file_aio_write+0x88/0x100
[<ffffffffa06531fc>] nfs_file_write+0x10c/0x210 [nfs]
[<ffffffffa06530f1>] ? nfs_file_write+0x1/0x210 [nfs]
[<ffffffff81197c9a>] do_sync_write+0xfa/0x140
[<ffffffff81096bc0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
[<ffffffff81197f78>] vfs_write+0xb8/0x1a0
[<ffffffff8100ba8e>] ? common_interrupt+0xe/0x13
[<ffffffff81198871>] sys_write+0x51/0x90
[<ffffffff8100b102>] system_call_fastpath+0x16/0x1b


Hi, corosync use multicast and nfs unicast, so i'm not sure that it's the same problem.
But maybe something is wrong on your nics (drivers) or switchs.
What is your nic model ?
 
Hi! I'm back.

May be let me summarize.

The 2 servers are HP DL360G5 Quad core with Broadcom NetXtreme II BCM5708 and Intel 82571EB. I don't know if the network is connected on the Intel or Broadcom interface. They are connected to an Asanté FriendlyNET GX5-2400 switch. These servers was running PVE 1.9 for years without problem and last week, I reinstalled them with PVE 2.3 (full install not upgrade). The "NAS" where is located the NFS mount is also a DL360G5.

Node 1 contains 6 CT, node 2, only one.

I don't see error on the nic interfaces (node 1, node 2 and NAS). Connexion seems ok. VM backup on the NFS mount works fine.

Currently, I only found that restarting pve-cluster on both PVE node is needed to have the system back and running.

System is running fine all the day but at the morning I discover that the cluster is no longer running.

I'm lost on this problem.
 
Last edited:
ok, so you don't use cisco switches finally ?

Do you have a dedicated card for storage and a dedicated card for lan/proxmox communication ?

Do yo use nfs v3 or v4 ?

- - - Updated - - -

also, I'm reading your switch documentation :

http://www.asante.com/downloads/userManuals/GX6-2400W_UM.pdf

page32

Try to disable "igmp snooping enable", and enable "igmp querying enable"

This should help for multicast problem.

(But i'm not sure it's related to your storage problem)
 
also, I'm reading your switch documentation :

http://www.asante.com/downloads/userManuals/GX6-2400W_UM.pdf

page32

Try to disable "igmp snooping enable", and enable "igmp querying enable"

This should help for multicast problem.

(But i'm not sure it's related to your storage problem).



About your nfs problem :

Do you use separate nics for storage and for lan/proxmox communication ?
Do you use nfs v3 or v4 ?
I'm personnaly prefer to use intel card for storage, drivers are less buggy than intel
 
Hi Spririt!

Thanks for the quick answer.

Sorry, the GX5-2400 is without "W" at the end. I don't have management on this one.

I use nfsv3 and I have only one active nic for all services. I plan to replace the old NAS using a DL360G5 with a new one but not done yet. This one is slow and when it is writing, sometime, I can have timeout.

I'm currently running a backup of all CT on node 1 to check if the backup will crash the cluster. I changed the schedule to now.
 
Again me.

Backup of all CT sucessfull. Cluster still running fine. Backup is not related to the cluster crash. In fact the backup was running at 7:30 am and cluster seems to crash during the night.
 
Hello!

I'm still trying to figure the problem.

I made some test.

First test:

I was running ICMP ping to the NAS server on PVE nodes on one shell window. Time is around 0.1.

But at the same time, in another shell, I can see on both PVE nodes many "pvestatd[232274]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout"

Second test:

I had a cluster crash just when I was connected to one of the PVE node. I got this in log:

[TOTEM ] A processor failed, forming new configuration.

Just when the crash started, the shell screen was not responding or was very slow to print info. After few second, the shell response go back to normal. But cluster was crashed. I had to restart pve-cluster. GUI was responsive, I can see syslog fill in the screen.

Both PVE server are close to idle with a very low load.

To resume, I got plenty of "DF failed", and cluster crash when communication seem to be lost between the 2 nodes. But with a good ping.

If someone have some idea and more test to look for, I will sincerely appreciated.
 
Hello!

This is just to say that the cluster engine is no longer crashing since the last kernel update.

I still have some df failed but this is due to my old NAS device.

Thanks.
 
Hello!

The problem is still here. Again for the second time in 2 days, I've found the cluster crashed this morning with:

Mar 22 08:39:03 aweb-vs003 pmxcfs[1325]: [status] crit: cpg_send_message failed: 12

I've changed the gigabits switch yesterday. I now use an HP 1810-24G v2. This node use a BCM5708 card. The second node use an e1000e card. I also now pointed the nfs share for backup on a new NAS device who is new and fast (nfsv4).

If I restart pve-cluster, pvedaemon and pvestatd, the cluster go back fine.

I'm personnaly prefer to use intel card for storage, drivers are less buggy than intel

Sorry, but I don't understand. (intel/intel)

 
Last edited:
Hello folk!

Does someone know what these lines TOTEM in syslog means?


Mar 22 16:24:58 aweb-vs004 pvestatd[1829]: WARNING: command 'df -P -B 1 /mnt/pve/Backup-VZDump' failed: got timeout
Mar 22 16:25:01 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f829 f82a f82b
Mar 22 16:25:01 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f829 f82a
Mar 22 16:25:16 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f847 f848
Mar 22 16:25:16 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f848
Mar 22 16:25:21 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f848
Mar 22 16:28:06 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: f970 f971 f972 f973
Mar 22 16:30:46 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fa8f fa90 fa91 fa92 fa93 fa94
Mar 22 16:30:46 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fa92 fa93 fa97 fa98 fa99
Mar 22 16:30:46 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fa98
Mar 22 16:31:09 aweb-vs004 pvestatd[1829]: WARNING: storage 'Backup-VZDump' is not online
Mar 22 16:31:10 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: faa9 faaa faab faac
Mar 22 16:31:10 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fab9 faba
Mar 22 16:31:18 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fac5 fac6 fac7 fac8 fac9
Mar 22 16:31:18 aweb-vs004 corosync[1486]: [TOTEM ] Retransmit List: fac5 fac6 fac7 fac8
Mar 22 16:33:08 aweb-vs004 pvedaemon[388459]: <root@pam> successful auth for user 'root@pam'
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!