Multiple Cluster issues - online but nodes red

Mar 24, 2017
13
0
1
61
I have a cluster that has been running for a few months. We recently purchased the enterprise licenses and began to walk through the upgrade of all 10 servers. i was on server #9 when I ran into an issue following a reboot.

When the server rebooted I noticed the cluster gui (which is running on the last unupgraded node) changed and all nodes except it self went from green to red. All VM's are running I still have a quorum but no longer have any vnc console connection for any vm. Logging into the gui of any of the other nodes shows them as green and give me stats but the other nodes are considered offline.

Ultimately, I'm looking for some advice on how to possibly correct this situation without a full restart of all 10 nodes. And if I have to do a full restart, what should I watch out for on bringing the cluster back up.

Thanks for any assistance.

I have tried to restart pvestatd but it fails with the following kernel message.


Mar 23 15:59:24 prox1 systemd[1]: Stopping PVE Status Daemon...
Mar 23 16:00:54 prox1 systemd[1]: pvestatd.service stopping timed out. Terminating.
Mar 23 16:00:54 prox1 pvestatd[2073]: received signal TERM
Mar 23 16:00:54 prox1 pvestatd[2073]: server closing
Mar 23 16:00:54 prox1 pvestatd[2073]: server stopped
Mar 23 16:01:28 prox1 kernel: [1330092.292096] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:01:28 prox1 kernel: [1330092.292117] Not tainted 4.4.35-1-pve #1
Mar 23 16:01:28 prox1 kernel: [1330092.292131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:01:28 prox1 kernel: [1330092.292154] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:01:28 prox1 kernel: [1330092.292158] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:01:28 prox1 kernel: [1330092.292160] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:01:28 prox1 kernel: [1330092.292161] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:01:28 prox1 kernel: [1330092.292164] Call Trace:
Mar 23 16:01:28 prox1 kernel: [1330092.292170] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:01:28 prox1 kernel: [1330092.292172] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:01:28 prox1 kernel: [1330092.292174] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:01:28 prox1 kernel: [1330092.292176] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:01:28 prox1 kernel: [1330092.292181] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:01:28 prox1 kernel: [1330092.292183] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:01:28 prox1 kernel: [1330092.292186] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:02:24 prox1 systemd[1]: pvestatd.service stop-sigterm timed out. Killing.
Mar 23 16:03:28 prox1 kernel: [1330212.303588] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:03:28 prox1 kernel: [1330212.303609] Not tainted 4.4.35-1-pve #1
Mar 23 16:03:28 prox1 kernel: [1330212.303622] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:03:28 prox1 kernel: [1330212.303646] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:03:28 prox1 kernel: [1330212.303649] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:03:28 prox1 kernel: [1330212.303651] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:03:28 prox1 kernel: [1330212.303653] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:03:28 prox1 kernel: [1330212.303654] Call Trace:
Mar 23 16:03:28 prox1 kernel: [1330212.303661] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:03:28 prox1 kernel: [1330212.303663] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:03:28 prox1 kernel: [1330212.303664] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:03:28 prox1 kernel: [1330212.303666] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:03:28 prox1 kernel: [1330212.303670] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:03:28 prox1 kernel: [1330212.303672] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:03:28 prox1 kernel: [1330212.303675] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:03:55 prox1 systemd[1]: pvestatd.service still around after SIGKILL. Ignoring.

there are still systemd commands lurking about to start and stop the process as well.

Here is other pertinent output related to the installation:

# from the un upgraded node 1:
# cat .members
{
"nodename": "prox1",
"version": 124,
"cluster": { "name": "xxxxx", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
"prox2": { "id": 2, "online": 1, "ip": "10.20.1.120"},
"prox1": { "id": 1, "online": 1, "ip": "10.20.1.98"},
"prox8": { "id": 4, "online": 1, "ip": "10.20.1.92"},
"prox9": { "id": 5, "online": 1, "ip": "10.20.1.94"},
"prox10": { "id": 6, "online": 1, "ip": "10.20.1.96"},
"prox3": { "id": 7, "online": 1, "ip": "10.20.3.90"},
"prox4": { "id": 8, "online": 1, "ip": "10.20.3.92"},
"prox7": { "id": 3, "online": 1, "ip": "10.20.1.90"},
"prox6": { "id": 10, "online": 1, "ip": "10.20.3.96"},
"prox5": { "id": 9, "online": 1, "ip": "10.20.3.94"}
}
}


# pvecm status
Quorum information
------------------
Date: Thu Mar 23 17:34:50 2017
Quorum provider: corosync_votequorum
Nodes: 10
Node ID: 0x00000001
Ring ID: 3/724
Quorate: Yes

Votequorum information
----------------------
Expected votes: 10
Highest expected: 10
Total votes: 10
Quorum: 6
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.20.1.90
0x00000004 1 10.20.1.92
0x00000005 1 10.20.1.94
0x00000006 1 10.20.1.96
0x00000001 1 10.20.1.98 (local)
0x00000002 1 10.20.1.120
0x00000007 1 10.20.3.90
0x00000008 1 10.20.3.92
0x00000009 1 10.20.3.94
0x0000000a 1 10.20.3.96


# pvesm status
local dir 1 997426160 829207300 117529428 88.09%
qnap nfs 1 16194897920 5106433024 11088464896 32.03%


# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
3 1 prox7
4 1 prox8
5 1 prox9
6 1 prox10
1 1 prox1 (local)
2 1 prox2
7 1 prox3
8 1 prox4
9 1 prox5
10 1 prox6

# etc hosts only shows entries for the local and 1 other node

# authorized keys shows valid keys for all nodes

# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-2 (running version: 4.4-2/80259e05)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-84
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-89
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80

#pve version of other nodes:
root@prox3:~/.ssh# pveversion -v
proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
 
seems like pvestatd is hanging (badly) on that node? I see that it is not yet fully upgraded, could you try to do that?
 
Its not just that node alone or only that service in the cluster. During the daily daemon sweep cron each node attempted restart pveproxy. This failed with the syslog listed below.

I now have no access to any of the gui's on any node.

Mar 24 06:25:01 prox3 systemd[1]: Stopping PVE API Proxy Server...
Mar 24 06:26:31 prox3 systemd[1]: pveproxy.service stopping timed out. Terminating.
Mar 24 06:26:31 prox3 pveproxy[20566]: received signal TERM
Mar 24 06:26:31 prox3 pveproxy[20567]: worker exit
Mar 24 06:26:31 prox3 pveproxy[20566]: server closing
Mar 24 06:26:31 prox3 pveproxy[20566]: worker 20567 finished
Mar 24 06:26:31 prox3 pveproxy[20566]: worker 20569 finished
Mar 24 06:26:31 prox3 pveproxy[20566]: worker 20568 finished
Mar 24 06:26:31 prox3 pveproxy[20566]: server stopped
Mar 24 06:28:02 prox3 systemd[1]: pveproxy.service stop-sigterm timed out. Killing.
Mar 24 06:28:37 prox3 kernel: [727681.948952] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:28:37 prox3 kernel: [727681.948971] Not tainted 4.4.40-1-pve #1
Mar 24 06:28:37 prox3 kernel: [727681.948985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:28:37 prox3 kernel: [727681.949008] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:28:37 prox3 kernel: [727681.949012] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:28:37 prox3 kernel: [727681.949014] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:28:37 prox3 kernel: [727681.949015] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:28:37 prox3 kernel: [727681.949017] Call Trace:
Mar 24 06:28:37 prox3 kernel: [727681.949024] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:28:37 prox3 kernel: [727681.949026] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:28:37 prox3 kernel: [727681.949028] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:28:37 prox3 kernel: [727681.949030] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:28:37 prox3 kernel: [727681.949034] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:28:37 prox3 kernel: [727681.949036] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:28:37 prox3 kernel: [727681.949038] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:29:32 prox3 systemd[1]: pveproxy.service still around after SIGKILL. Ignoring.
Mar 24 06:30:37 prox3 kernel: [727801.949251] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:30:37 prox3 kernel: [727801.949271] Not tainted 4.4.40-1-pve #1
Mar 24 06:30:37 prox3 kernel: [727801.949285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:30:37 prox3 kernel: [727801.949321] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:30:37 prox3 kernel: [727801.949326] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:30:37 prox3 kernel: [727801.949329] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:30:37 prox3 kernel: [727801.949330] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:30:37 prox3 kernel: [727801.949334] Call Trace:
Mar 24 06:30:37 prox3 kernel: [727801.949340] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:30:37 prox3 kernel: [727801.949342] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:30:37 prox3 kernel: [727801.949344] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:30:37 prox3 kernel: [727801.949346] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:30:37 prox3 kernel: [727801.949351] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:30:37 prox3 kernel: [727801.949354] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:30:37 prox3 kernel: [727801.949358] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:31:02 prox3 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Mar 24 06:32:32 prox3 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Mar 24 06:32:32 prox3 systemd[1]: Unit pveproxy.service entered failed state.
Mar 24 06:32:32 prox3 systemd[1]: Starting PVE API Proxy Server...
Mar 24 06:32:37 prox3 kernel: [727921.949591] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:32:37 prox3 kernel: [727921.949610] Not tainted 4.4.40-1-pve #1
Mar 24 06:32:37 prox3 kernel: [727921.949624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:32:37 prox3 kernel: [727921.949647] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:32:37 prox3 kernel: [727921.949650] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:32:37 prox3 kernel: [727921.949652] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:32:37 prox3 kernel: [727921.949654] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:32:37 prox3 kernel: [727921.949656] Call Trace:
Mar 24 06:32:37 prox3 kernel: [727921.949662] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:32:37 prox3 kernel: [727921.949664] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:32:37 prox3 kernel: [727921.949666] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:32:37 prox3 kernel: [727921.949668] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:32:37 prox3 kernel: [727921.949673] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:32:37 prox3 kernel: [727921.949674] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:32:37 prox3 kernel: [727921.949677] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:34:03 prox3 systemd[1]: pveproxy.service start operation timed out. Terminating.
Mar 24 06:34:37 prox3 kernel: [728041.949900] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:34:37 prox3 kernel: [728041.949946] Not tainted 4.4.40-1-pve #1
Mar 24 06:34:37 prox3 kernel: [728041.949982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:34:37 prox3 kernel: [728041.950051] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:34:37 prox3 kernel: [728041.950054] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:34:37 prox3 kernel: [728041.950056] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:34:37 prox3 kernel: [728041.950058] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:34:37 prox3 kernel: [728041.950060] Call Trace:
Mar 24 06:34:37 prox3 kernel: [728041.950066] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:34:37 prox3 kernel: [728041.950068] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:34:37 prox3 kernel: [728041.950070] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:34:37 prox3 kernel: [728041.950072] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:34:37 prox3 kernel: [728041.950076] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:34:37 prox3 kernel: [728041.950078] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:34:37 prox3 kernel: [728041.950080] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:34:37 prox3 kernel: [728041.950082] INFO: task pveproxy:109715 blocked for more than 120 seconds.
Mar 24 06:34:37 prox3 kernel: [728041.950125] Not tainted 4.4.40-1-pve #1
Mar 24 06:34:37 prox3 kernel: [728041.950161] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:34:37 prox3 kernel: [728041.950226] pveproxy D ffff8842b1757df8 0 109715 1 0x00000004
Mar 24 06:34:37 prox3 kernel: [728041.950228] ffff8842b1757df8 ffff8847cfc023c0 ffff8847d5b4f000 ffff881178875400
Mar 24 06:34:37 prox3 kernel: [728041.950230] ffff8842b1758000 ffff8823c37483ac ffff881178875400 00000000ffffffff
Mar 24 06:34:37 prox3 kernel: [728041.950231] ffff8823c37483b0 ffff8842b1757e10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:34:37 prox3 kernel: [728041.950233] Call Trace:
Mar 24 06:34:37 prox3 kernel: [728041.950234] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:34:37 prox3 kernel: [728041.950236] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:34:37 prox3 kernel: [728041.950237] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:34:37 prox3 kernel: [728041.950239] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:34:37 prox3 kernel: [728041.950241] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:34:37 prox3 kernel: [728041.950242] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:34:37 prox3 kernel: [728041.950244] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:35:33 prox3 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Mar 24 06:36:37 prox3 kernel: [728161.950213] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:36:37 prox3 kernel: [728161.950259] Not tainted 4.4.40-1-pve #1
Mar 24 06:36:37 prox3 kernel: [728161.950296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:36:37 prox3 kernel: [728161.950362] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:36:37 prox3 kernel: [728161.950373] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:36:37 prox3 kernel: [728161.950375] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:36:37 prox3 kernel: [728161.950377] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:36:37 prox3 kernel: [728161.950379] Call Trace:
Mar 24 06:36:37 prox3 kernel: [728161.950385] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:36:37 prox3 kernel: [728161.950387] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:36:37 prox3 kernel: [728161.950389] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:36:37 prox3 kernel: [728161.950391] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:36:37 prox3 kernel: [728161.950395] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:36:37 prox3 kernel: [728161.950397] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:36:37 prox3 kernel: [728161.950399] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:36:37 prox3 kernel: [728161.950401] INFO: task pveproxy:109715 blocked for more than 120 seconds.
Mar 24 06:36:37 prox3 kernel: [728161.950443] Not tainted 4.4.40-1-pve #1
Mar 24 06:36:37 prox3 kernel: [728161.950479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:36:37 prox3 kernel: [728161.950545] pveproxy D ffff8842b1757df8 0 109715 1 0x00000004
Mar 24 06:36:37 prox3 kernel: [728161.950546] ffff8842b1757df8 ffff8847cfc023c0 ffff8847d5b4f000 ffff881178875400
Mar 24 06:36:37 prox3 kernel: [728161.950548] ffff8842b1758000 ffff8823c37483ac ffff881178875400 00000000ffffffff
Mar 24 06:36:37 prox3 kernel: [728161.950550] ffff8823c37483b0 ffff8842b1757e10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:36:37 prox3 kernel: [728161.950551] Call Trace:
Mar 24 06:36:37 prox3 kernel: [728161.950553] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:36:37 prox3 kernel: [728161.950554] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:36:37 prox3 kernel: [728161.950556] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:36:37 prox3 kernel: [728161.950557] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:36:37 prox3 kernel: [728161.950559] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:36:37 prox3 kernel: [728161.950561] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:36:37 prox3 kernel: [728161.950562] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:37:03 prox3 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Mar 24 06:37:03 prox3 systemd[1]: Failed to start PVE API Proxy Server.
Mar 24 06:37:03 prox3 systemd[1]: Unit pveproxy.service entered failed state.
Mar 24 06:37:03 prox3 systemd[1]: Stopping PVE SPICE Proxy Server...
Mar 24 06:38:33 prox3 systemd[1]: spiceproxy.service stopping timed out. Terminating.
Mar 24 06:38:33 prox3 spiceproxy[20593]: worker exit
Mar 24 06:38:33 prox3 spiceproxy[20592]: received signal TERM
Mar 24 06:38:33 prox3 spiceproxy[20592]: server closing
Mar 24 06:38:33 prox3 spiceproxy[20592]: worker 20593 finished
Mar 24 06:38:33 prox3 spiceproxy[20592]: server stopped
Mar 24 06:38:37 prox3 kernel: [728281.950545] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:38:37 prox3 kernel: [728281.950592] Not tainted 4.4.40-1-pve #1
Mar 24 06:38:37 prox3 kernel: [728281.950629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:38:37 prox3 kernel: [728281.950695] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:38:37 prox3 kernel: [728281.950698] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:38:37 prox3 kernel: [728281.950700] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:38:37 prox3 kernel: [728281.950702] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:38:37 prox3 kernel: [728281.950704] Call Trace:
Mar 24 06:38:37 prox3 kernel: [728281.950709] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:38:37 prox3 kernel: [728281.950712] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:38:37 prox3 kernel: [728281.950713] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:38:37 prox3 kernel: [728281.950715] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:38:37 prox3 kernel: [728281.950718] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:38:37 prox3 kernel: [728281.950720] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:38:37 prox3 kernel: [728281.950722] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:38:37 prox3 kernel: [728281.950724] INFO: task pveproxy:109715 blocked for more than 120 seconds.
Mar 24 06:38:37 prox3 kernel: [728281.950766] Not tainted 4.4.40-1-pve #1
Mar 24 06:38:37 prox3 kernel: [728281.950802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:38:37 prox3 kernel: [728281.950868] pveproxy D ffff8842b1757df8 0 109715 1 0x00000004
Mar 24 06:38:37 prox3 kernel: [728281.950870] ffff8842b1757df8 ffff8847cfc023c0 ffff8847d5b4f000 ffff881178875400
Mar 24 06:38:37 prox3 kernel: [728281.950871] ffff8842b1758000 ffff8823c37483ac ffff881178875400 00000000ffffffff
Mar 24 06:38:37 prox3 kernel: [728281.950873] ffff8823c37483b0 ffff8842b1757e10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:38:37 prox3 kernel: [728281.950874] Call Trace:
Mar 24 06:38:37 prox3 kernel: [728281.950876] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:38:37 prox3 kernel: [728281.950877] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:38:37 prox3 kernel: [728281.950878] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:38:37 prox3 kernel: [728281.950880] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:38:37 prox3 kernel: [728281.950882] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:38:37 prox3 kernel: [728281.950883] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:38:37 prox3 kernel: [728281.950885] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:40:04 prox3 systemd[1]: spiceproxy.service stop-sigterm timed out. Killing.
Mar 24 06:40:37 prox3 kernel: [728401.950894] INFO: task pveproxy:109240 blocked for more than 120 seconds.
Mar 24 06:40:37 prox3 kernel: [728401.950940] Not tainted 4.4.40-1-pve #1
Mar 24 06:40:37 prox3 kernel: [728401.950976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 06:40:37 prox3 kernel: [728401.951046] pveproxy D ffff8841be81bdf8 0 109240 1 0x00000004
Mar 24 06:40:37 prox3 kernel: [728401.951050] ffff8841be81bdf8 ffff8847cfc023c0 ffff8847d5038000 ffff8823d046aa00
Mar 24 06:40:37 prox3 kernel: [728401.951052] ffff8841be81c000 ffff8823c37483ac ffff8823d046aa00 00000000ffffffff
Mar 24 06:40:37 prox3 kernel: [728401.951053] ffff8823c37483b0 ffff8841be81be10 ffffffff81858205 ffff8823c37483a8
Mar 24 06:40:37 prox3 kernel: [728401.951056] Call Trace:
Mar 24 06:40:37 prox3 kernel: [728401.951061] [<ffffffff81858205>] schedule+0x35/0x80
Mar 24 06:40:37 prox3 kernel: [728401.951064] [<ffffffff818584be>] schedule_preempt_disabled+0xe/0x10
Mar 24 06:40:37 prox3 kernel: [728401.951066] [<ffffffff8185a1b9>] __mutex_lock_slowpath+0xb9/0x130
Mar 24 06:40:37 prox3 kernel: [728401.951068] [<ffffffff8185a24f>] mutex_lock+0x1f/0x30
Mar 24 06:40:37 prox3 kernel: [728401.951072] [<ffffffff8121f2ba>] filename_create+0x7a/0x160
Mar 24 06:40:37 prox3 kernel: [728401.951074] [<ffffffff81220253>] SyS_mkdir+0x53/0x100
Mar 24 06:40:37 prox3 kernel: [728401.951076] [<ffffffff8185c336>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 24 06:41:34 prox3 systemd[1]: spiceproxy.service still around after SIGKILL. Ignoring.
Mar 24 06:43:04 prox3 systemd[1]: spiceproxy.service stop-final-sigterm timed out. Killing.
Mar 24 06:44:34 prox3 systemd[1]: spiceproxy.service still around after final SIGKILL. Entering failed mode.
Mar 24 06:44:34 prox3 systemd[1]: Unit spiceproxy.service entered failed state.
Mar 24 06:44:34 prox3 systemd[1]: Starting PVE API Proxy Server...
Mar 24 06:46:05 prox3 systemd[1]: pveproxy.service start operation timed out. Terminating.
Mar 24 06:47:35 prox3 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Mar 24 06:49:05 prox3 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Mar 24 06:49:05 prox3 systemd[1]: Failed to start PVE API Proxy Server.
Mar 24 06:49:05 prox3 systemd[1]: Unit pveproxy.service entered failed state.
Mar 24 06:49:05 prox3 systemd[1]: Starting PVE SPICE Proxy Server...
Mar 24 06:50:35 prox3 systemd[1]: spiceproxy.service start operation timed out. Terminating.
Mar 24 06:52:06 prox3 systemd[1]: spiceproxy.service stop-final-sigterm timed out. Killing.
Mar 24 06:53:36 prox3 systemd[1]: spiceproxy.service still around after final SIGKILL. Entering failed mode.
Mar 24 06:53:36 prox3 systemd[1]: Failed to start PVE SPICE Proxy Server.
Mar 24 06:53:36 prox3 systemd[1]: Unit spiceproxy.service entered failed state.
 
One other noted behavior. I have one node that I can restart at will without any issues. After a fresh reboot. If i stop the pve-cluster and pmxcfs -l, Things work normally on that node. I can start and stop VM's without issue, vnc works as well as the freshly started vm.

If i reboot with pve-cluster active, I get the following in daemon.log

Code:
Mar 24 11:31:29 prox2 pvedaemon[2613]: start VM 700: UPID:prox2:00000A35:000063CF:58D56601:qmstart:700:root@pam:
Mar 24 11:31:29 prox2 systemd[1]: Starting qemu.slice.
Mar 24 11:31:29 prox2 systemd[1]: Created slice qemu.slice.
Mar 24 11:31:29 prox2 systemd[1]: Starting 700.scope.
Mar 24 11:31:29 prox2 systemd[1]: Started 700.scope.
Mar 24 11:31:29 prox2 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
Mar 24 11:31:31 prox2 pvedaemon[2632]: starting vnc proxy UPID:prox2:00000A48:00006463:58D56603:vncproxy:700:root@pam:
Mar 24 11:31:59 prox2 pvedaemon[2613]: start failed: command '/usr/bin/kvm -id 700 -chardev 'socket,id=qmp,path=/var/run/qemu-server/700.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/700.pid -daemonize -smbios 'type=1,uuid=4990ec6f-713a-43b3-b8d4-74cc23303c59' -name iotest-3.240 -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga cirrus -vnc unix:/var/run/qemu-server/700.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 20480 -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:bed78da5b2e0' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/mnt/pve/qnap/images/700/vm-700-disk-1.raw,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -drive 'file=/mnt/pve/qnap/images/700/vm-700-disk-2.raw,if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1' -netdev 'type=tap,id=net0,ifname=tap700i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=FE:45:1B:42:7E:39,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: got timeout
 
I get the same errors, it worked well until I tried to add a fourth node to a working cluster.
All the nodes are fully updated, i ran apt-get dist-upgrade before joining them to the cluster.

The pvecm add ip hung after "creating certificates" and from there everything broke.
I cannot migrate vm between nodes and it seems that any command related to the cluster hangs in D state.

Code:
27708 ?        Ds     0:00 task UPID e710ll-02:00006C3C:00756B9B:58D7C943:qmigrate:106:root@pam:
27174 ?        Ds     0:00 task UPID e710ll-02:00006A26:00743987:58D7C633:qmigrate:800:root@pam:

I tried to reboot one node that has no vm running, i see this in dmesg (i have copied only a part of the text)

Code:
[266160.232082] INFO: task vzdump:2122 blocked for more than 120 seconds.
[266160.232187]       Tainted: P           O    4.4.35-1-pve #1
[266160.232278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[266160.232391] vzdump          D ffff8805b4e1bdf8     0  2122      1 0x00000004
[266160.232396]  ffff8805b4e1bdf8 ffff880666c50780 ffff880663fa8c80 ffff88065d477080
[266160.232399]  ffff8805b4e1c000 ffff880035a2ccac ffff88065d477080 00000000ffffffff
[266160.232402]  ffff880035a2ccb0 ffff8805b4e1be10 ffffffff81856155 ffff880035a2cca8
[266160.232404] Call Trace:
[266160.232413]  [<ffffffff81856155>] schedule+0x35/0x80
[266160.232416]  [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
[266160.232418]  [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
[266160.232420]  [<ffffffff8185819f>] mutex_lock+0x1f/0x30
[266160.232425]  [<ffffffff8121e19a>] filename_create+0x7a/0x160
[266160.232427]  [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
[266160.232430]  [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
[266280.232078] INFO: task vzdump:2122 blocked for more than 120 seconds.
[266280.232198]       Tainted: P           O    4.4.35-1-pve #1
[266280.232288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[266280.232401] vzdump          D ffff8805b4e1bdf8     0  2122      1 0x00000004
[266280.232406]  ffff8805b4e1bdf8 ffff880666c50780 ffff880663fa8c80 ffff88065d477080
[266280.232409]  ffff8805b4e1c000 ffff880035a2ccac ffff88065d477080 00000000ffffffff
[266280.232412]  ffff880035a2ccb0 ffff8805b4e1be10 ffffffff81856155 ffff880035a2cca8
[266280.232414] Call Trace:
[266280.232423]  [<ffffffff81856155>] schedule+0x35/0x80
[266280.232426]  [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
[266280.232428]  [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
[266280.232430]  [<ffffffff8185819f>] mutex_lock+0x1f/0x30
[266280.232435]  [<ffffffff8121e19a>] filename_create+0x7a/0x160
[266280.232437]  [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
[266280.232440]  [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
[266400.232078] INFO: task vzdump:2122 blocked for more than 120 seconds.
....

I have these processes in D state

Code:
1     0  2003     1  20   0 264276 74852 reques Ds   ?          0:32 pve-ha-lrm
0     0  2122     1  20   0 236320 64848 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/vzdump -stop
4     0  2645     1  20   0 238588 65032 filena Ds   ?          0:00 /usr/bin/perl /usr/sbin/pve-ha-lrm stop
4     0  2647     1  20   0 234968 61604 filena Ds   ?          0:00 /usr/bin/perl /usr/sbin/pve-firewall stop
4     0  2648     1  20   0 239200 65680 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/spiceproxy stop
4     0  2650     1  20   0 232520 59172 filena Ds   ?          0:00 /usr/bin/perl /usr/bin/pvestatd stop
Please help me, I have production vm's running on the servers and I cannot do anything with them at the moment, even the console does not work.
 
I spent several days trying to work out a solution to the issue but the bottom line is that the full communication within the pve-cluster was broken. Once an attempt was made to communicate between nodes (after an individual node restart) the node was locked with the others. This indicates to me something had corrupted the pve-cluster process and regardless of the corosync quorum and the config sharing the nodes simply could not communicate. Unfortunately, this seems to also affect the qm commands, stats and vnc servers.

As far as I know there is no solution to this without performing a full reboot of the cluster. That is what I did last night. I shut down all 70+ vms and performed a forced shutdown on each node (poweroff). I then restarted each node one at a time starting with the original ring0 leader.
The nodes came backup into the cluster without any issues and I was able to relaunch all of the VM's

Not a graceful solution but it worked.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!