Hi,
Yesterday I've had a strange issue on a proxmox cluster with 4 nodes. I will try to describe as best as I can.
First of all, all nodes are updated proxmox 4.4 .
The nodes use a NAS to store ISO and VM images and another NAS to store snapshots. The nodes access to the NAS through NFS.
There are approximately 10 VMs in the cluster.
So, when I logged in to the first node (node1) on the web interface, I noticed it was no able to see the other nodes (all nodes were red in the web interface). I noticed the same issue on node2 and node3. node4 was able to see everybody. All VMs inside each node seemed to work properly.
When I connected to node1, node2, node3, I was not able to ping each other.
When I checked the cluster status on each node, it returned that cluster has 4 nodes available and clean.
I started to be a little bit stressed. So I started to shutdown VMs on node1 to reboot the node. I was able to stop all VMs on this node without issue.
But I was not able to reboot the node. All proxmox software/deamon was not able to be stopped or killed by systemd.
Here is a log of what happend when I bengin to stop VM and restart. There is nothing strange before.
So I did a hard reboot of node1. When it was shutdown, node2 and node3 got back to normal. When node1 was started, it worked propely too (Even if it restarted a VM itself for no reason).
What happened ? How to prevent this from happening again ? What should had be done when it happened ?
Yesterday I've had a strange issue on a proxmox cluster with 4 nodes. I will try to describe as best as I can.
First of all, all nodes are updated proxmox 4.4 .
proxmox-ve: 4.4-92 (running kernel: 4.4.67-1-pve)
pve-manager: 4.4-15 (running version: 4.4-15/7599e35a)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
pve-manager: 4.4-15 (running version: 4.4-15/7599e35a)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
The nodes use a NAS to store ISO and VM images and another NAS to store snapshots. The nodes access to the NAS through NFS.
There are approximately 10 VMs in the cluster.
So, when I logged in to the first node (node1) on the web interface, I noticed it was no able to see the other nodes (all nodes were red in the web interface). I noticed the same issue on node2 and node3. node4 was able to see everybody. All VMs inside each node seemed to work properly.
When I connected to node1, node2, node3, I was not able to ping each other.
When I checked the cluster status on each node, it returned that cluster has 4 nodes available and clean.
I started to be a little bit stressed. So I started to shutdown VMs on node1 to reboot the node. I was able to stop all VMs on this node without issue.
But I was not able to reboot the node. All proxmox software/deamon was not able to be stopped or killed by systemd.
Here is a log of what happend when I bengin to stop VM and restart. There is nothing strange before.
Aug 8 17:51:55 node1 pvedaemon[29132]: <root@pam> starting task UPID:node1:00000FE3:14C3795E:5989DE1B:vncproxy:105:root@pam:
Aug 8 17:51:59 node1 pvedaemon[29130]: <root@pam> starting task UPID:node1:00000FEA:14C37AED:5989DE1F:qmshutdown:105:root@pam:
Aug 8 17:52:05 node1 pvedaemon[29132]: <root@pam> end task UPID:node1:00000FE3:14C3795E:5989DE1B:vncproxy:105:root@pam: OK
Aug 8 17:52:29 node1 kernel: [3483273.136640] vmbr0: port 6(tap105i0) entered disabled state
Aug 8 17:52:30 node1 pvedaemon[29130]: <root@pam> end task UPID:node1:00000FEA:14C37AED:5989DE1F:qmshutdown:105:root@pam: OK
Aug 8 17:54:47 node1 kernel: [3483410.601064] INFO: task qm:4068 blocked for more than 120 seconds.
Aug 8 17:54:47 node1 kernel: [3483410.601084] Tainted: P IO 4.4.67-1-pve #1
Aug 8 17:54:47 node1 kernel: [3483410.601097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 17:54:47 node1 kernel: [3483410.601131] qm D ffff88066374fdf8 0 4068 1 0x00000000
Aug 8 17:54:47 node1 kernel: [3483410.601135] ffff88066374fdf8 ffff88066080efc0 ffff880c63e88cc0 ffff880c603dbfc0
Aug 8 17:54:47 node1 kernel: [3483410.601137] ffff880663750000 ffff880c504683ac ffff880c603dbfc0 00000000ffffffff
Aug 8 17:54:47 node1 kernel: [3483410.601139] ffff880c504683b0 ffff88066374fe10 ffffffff818605d5 ffff880c504683a8
Aug 8 17:54:47 node1 kernel: [3483410.601141] Call Trace:
Aug 8 17:54:47 node1 kernel: [3483410.601149] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 17:54:47 node1 kernel: [3483410.601151] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 17:54:47 node1 kernel: [3483410.601153] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 17:54:47 node1 kernel: [3483410.601155] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 17:54:47 node1 kernel: [3483410.601158] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 17:54:47 node1 kernel: [3483410.601160] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 17:54:47 node1 kernel: [3483410.601162] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
Aug 8 17:56:47 node1 kernel: [3483530.590225] INFO: task qm:4068 blocked for more than 120 seconds.
Aug 8 17:56:47 node1 kernel: [3483530.590259] Tainted: P IO 4.4.67-1-pve #1
Aug 8 17:56:47 node1 kernel: [3483530.590285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 17:56:47 node1 kernel: [3483530.590333] qm D ffff88066374fdf8 0 4068 1 0x00000000
Aug 8 17:56:47 node1 kernel: [3483530.590337] ffff88066374fdf8 ffff88066080efc0 ffff880c63e88cc0 ffff880c603dbfc0
Aug 8 17:56:47 node1 kernel: [3483530.590339] ffff880663750000 ffff880c504683ac ffff880c603dbfc0 00000000ffffffff
Aug 8 17:56:47 node1 kernel: [3483530.590341] ffff880c504683b0 ffff88066374fe10 ffffffff818605d5 ffff880c504683a8
Aug 8 17:56:47 node1 kernel: [3483530.590343] Call Trace:
Aug 8 17:56:47 node1 kernel: [3483530.590351] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 17:56:47 node1 kernel: [3483530.590353] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 17:56:47 node1 kernel: [3483530.590355] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 17:56:47 node1 kernel: [3483530.590357] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 17:56:47 node1 kernel: [3483530.590360] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 17:56:47 node1 kernel: [3483530.590362] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 17:56:47 node1 kernel: [3483530.590364] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
...
Aug 8 18:06:47 node1 kernel: [3484130.536220] INFO: task vzdump:4922 blocked for more than 120 seconds.
Aug 8 18:06:47 node1 kernel: [3484130.536248] Tainted: P IO 4.4.67-1-pve #1
Aug 8 18:06:47 node1 kernel: [3484130.536274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 18:06:47 node1 kernel: [3484130.536318] vzdump D ffff880c415efdf8 0 4922 1 0x00000004
Aug 8 18:06:47 node1 kernel: [3484130.536320] ffff880c415efdf8 ffff88066080efc0 ffffffff81e12540 ffff880c5fc23fc0
Aug 8 18:06:47 node1 kernel: [3484130.536322] ffff880c415f0000 ffff880c504683ac ffff880c5fc23fc0 00000000ffffffff
Aug 8 18:06:47 node1 kernel: [3484130.536324] ffff880c504683b0 ffff880c415efe10 ffffffff818605d5 ffff880c504683a8
Aug 8 18:06:47 node1 kernel: [3484130.536326] Call Trace:
Aug 8 18:06:47 node1 kernel: [3484130.536328] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 18:06:47 node1 kernel: [3484130.536330] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 18:06:47 node1 kernel: [3484130.536331] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 18:06:47 node1 kernel: [3484130.536333] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 18:06:47 node1 kernel: [3484130.536335] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 18:06:47 node1 kernel: [3484130.536337] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 18:06:47 node1 kernel: [3484130.536339] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
repeat again and again
Aug 8 17:51:59 node1 pvedaemon[29130]: <root@pam> starting task UPID:node1:00000FEA:14C37AED:5989DE1F:qmshutdown:105:root@pam:
Aug 8 17:52:05 node1 pvedaemon[29132]: <root@pam> end task UPID:node1:00000FE3:14C3795E:5989DE1B:vncproxy:105:root@pam: OK
Aug 8 17:52:29 node1 kernel: [3483273.136640] vmbr0: port 6(tap105i0) entered disabled state
Aug 8 17:52:30 node1 pvedaemon[29130]: <root@pam> end task UPID:node1:00000FEA:14C37AED:5989DE1F:qmshutdown:105:root@pam: OK
Aug 8 17:54:47 node1 kernel: [3483410.601064] INFO: task qm:4068 blocked for more than 120 seconds.
Aug 8 17:54:47 node1 kernel: [3483410.601084] Tainted: P IO 4.4.67-1-pve #1
Aug 8 17:54:47 node1 kernel: [3483410.601097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 17:54:47 node1 kernel: [3483410.601131] qm D ffff88066374fdf8 0 4068 1 0x00000000
Aug 8 17:54:47 node1 kernel: [3483410.601135] ffff88066374fdf8 ffff88066080efc0 ffff880c63e88cc0 ffff880c603dbfc0
Aug 8 17:54:47 node1 kernel: [3483410.601137] ffff880663750000 ffff880c504683ac ffff880c603dbfc0 00000000ffffffff
Aug 8 17:54:47 node1 kernel: [3483410.601139] ffff880c504683b0 ffff88066374fe10 ffffffff818605d5 ffff880c504683a8
Aug 8 17:54:47 node1 kernel: [3483410.601141] Call Trace:
Aug 8 17:54:47 node1 kernel: [3483410.601149] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 17:54:47 node1 kernel: [3483410.601151] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 17:54:47 node1 kernel: [3483410.601153] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 17:54:47 node1 kernel: [3483410.601155] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 17:54:47 node1 kernel: [3483410.601158] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 17:54:47 node1 kernel: [3483410.601160] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 17:54:47 node1 kernel: [3483410.601162] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
Aug 8 17:56:47 node1 kernel: [3483530.590225] INFO: task qm:4068 blocked for more than 120 seconds.
Aug 8 17:56:47 node1 kernel: [3483530.590259] Tainted: P IO 4.4.67-1-pve #1
Aug 8 17:56:47 node1 kernel: [3483530.590285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 17:56:47 node1 kernel: [3483530.590333] qm D ffff88066374fdf8 0 4068 1 0x00000000
Aug 8 17:56:47 node1 kernel: [3483530.590337] ffff88066374fdf8 ffff88066080efc0 ffff880c63e88cc0 ffff880c603dbfc0
Aug 8 17:56:47 node1 kernel: [3483530.590339] ffff880663750000 ffff880c504683ac ffff880c603dbfc0 00000000ffffffff
Aug 8 17:56:47 node1 kernel: [3483530.590341] ffff880c504683b0 ffff88066374fe10 ffffffff818605d5 ffff880c504683a8
Aug 8 17:56:47 node1 kernel: [3483530.590343] Call Trace:
Aug 8 17:56:47 node1 kernel: [3483530.590351] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 17:56:47 node1 kernel: [3483530.590353] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 17:56:47 node1 kernel: [3483530.590355] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 17:56:47 node1 kernel: [3483530.590357] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 17:56:47 node1 kernel: [3483530.590360] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 17:56:47 node1 kernel: [3483530.590362] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 17:56:47 node1 kernel: [3483530.590364] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
...
Aug 8 18:06:47 node1 kernel: [3484130.536220] INFO: task vzdump:4922 blocked for more than 120 seconds.
Aug 8 18:06:47 node1 kernel: [3484130.536248] Tainted: P IO 4.4.67-1-pve #1
Aug 8 18:06:47 node1 kernel: [3484130.536274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 18:06:47 node1 kernel: [3484130.536318] vzdump D ffff880c415efdf8 0 4922 1 0x00000004
Aug 8 18:06:47 node1 kernel: [3484130.536320] ffff880c415efdf8 ffff88066080efc0 ffffffff81e12540 ffff880c5fc23fc0
Aug 8 18:06:47 node1 kernel: [3484130.536322] ffff880c415f0000 ffff880c504683ac ffff880c5fc23fc0 00000000ffffffff
Aug 8 18:06:47 node1 kernel: [3484130.536324] ffff880c504683b0 ffff880c415efe10 ffffffff818605d5 ffff880c504683a8
Aug 8 18:06:47 node1 kernel: [3484130.536326] Call Trace:
Aug 8 18:06:47 node1 kernel: [3484130.536328] [<ffffffff818605d5>] schedule+0x35/0x80
Aug 8 18:06:47 node1 kernel: [3484130.536330] [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
Aug 8 18:06:47 node1 kernel: [3484130.536331] [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
Aug 8 18:06:47 node1 kernel: [3484130.536333] [<ffffffff8186261f>] mutex_lock+0x1f/0x30
Aug 8 18:06:47 node1 kernel: [3484130.536335] [<ffffffff8121fd0a>] filename_create+0x7a/0x160
Aug 8 18:06:47 node1 kernel: [3484130.536337] [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
Aug 8 18:06:47 node1 kernel: [3484130.536339] [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75
repeat again and again
So I did a hard reboot of node1. When it was shutdown, node2 and node3 got back to normal. When node1 was started, it worked propely too (Even if it restarted a VM itself for no reason).
What happened ? How to prevent this from happening again ? What should had be done when it happened ?