I have a cluster that has been running for a few months. We recently purchased the enterprise licenses and began to walk through the upgrade of all 10 servers. i was on server #9 when I ran into an issue following a reboot.
When the server rebooted I noticed the cluster gui (which is running on the last unupgraded node) changed and all nodes except it self went from green to red. All VM's are running I still have a quorum but no longer have any vnc console connection for any vm. Logging into the gui of any of the other nodes shows them as green and give me stats but the other nodes are considered offline.
Ultimately, I'm looking for some advice on how to possibly correct this situation without a full restart of all 10 nodes. And if I have to do a full restart, what should I watch out for on bringing the cluster back up.
Thanks for any assistance.
I have tried to restart pvestatd but it fails with the following kernel message.
Mar 23 15:59:24 prox1 systemd[1]: Stopping PVE Status Daemon...
Mar 23 16:00:54 prox1 systemd[1]: pvestatd.service stopping timed out. Terminating.
Mar 23 16:00:54 prox1 pvestatd[2073]: received signal TERM
Mar 23 16:00:54 prox1 pvestatd[2073]: server closing
Mar 23 16:00:54 prox1 pvestatd[2073]: server stopped
Mar 23 16:01:28 prox1 kernel: [1330092.292096] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:01:28 prox1 kernel: [1330092.292117] Not tainted 4.4.35-1-pve #1
Mar 23 16:01:28 prox1 kernel: [1330092.292131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:01:28 prox1 kernel: [1330092.292154] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:01:28 prox1 kernel: [1330092.292158] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:01:28 prox1 kernel: [1330092.292160] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:01:28 prox1 kernel: [1330092.292161] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:01:28 prox1 kernel: [1330092.292164] Call Trace:
Mar 23 16:01:28 prox1 kernel: [1330092.292170] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:01:28 prox1 kernel: [1330092.292172] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:01:28 prox1 kernel: [1330092.292174] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:01:28 prox1 kernel: [1330092.292176] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:01:28 prox1 kernel: [1330092.292181] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:01:28 prox1 kernel: [1330092.292183] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:01:28 prox1 kernel: [1330092.292186] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:02:24 prox1 systemd[1]: pvestatd.service stop-sigterm timed out. Killing.
Mar 23 16:03:28 prox1 kernel: [1330212.303588] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:03:28 prox1 kernel: [1330212.303609] Not tainted 4.4.35-1-pve #1
Mar 23 16:03:28 prox1 kernel: [1330212.303622] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:03:28 prox1 kernel: [1330212.303646] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:03:28 prox1 kernel: [1330212.303649] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:03:28 prox1 kernel: [1330212.303651] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:03:28 prox1 kernel: [1330212.303653] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:03:28 prox1 kernel: [1330212.303654] Call Trace:
Mar 23 16:03:28 prox1 kernel: [1330212.303661] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:03:28 prox1 kernel: [1330212.303663] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:03:28 prox1 kernel: [1330212.303664] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:03:28 prox1 kernel: [1330212.303666] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:03:28 prox1 kernel: [1330212.303670] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:03:28 prox1 kernel: [1330212.303672] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:03:28 prox1 kernel: [1330212.303675] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:03:55 prox1 systemd[1]: pvestatd.service still around after SIGKILL. Ignoring.
there are still systemd commands lurking about to start and stop the process as well.
Here is other pertinent output related to the installation:
# from the un upgraded node 1:
# cat .members
{
"nodename": "prox1",
"version": 124,
"cluster": { "name": "xxxxx", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
"prox2": { "id": 2, "online": 1, "ip": "10.20.1.120"},
"prox1": { "id": 1, "online": 1, "ip": "10.20.1.98"},
"prox8": { "id": 4, "online": 1, "ip": "10.20.1.92"},
"prox9": { "id": 5, "online": 1, "ip": "10.20.1.94"},
"prox10": { "id": 6, "online": 1, "ip": "10.20.1.96"},
"prox3": { "id": 7, "online": 1, "ip": "10.20.3.90"},
"prox4": { "id": 8, "online": 1, "ip": "10.20.3.92"},
"prox7": { "id": 3, "online": 1, "ip": "10.20.1.90"},
"prox6": { "id": 10, "online": 1, "ip": "10.20.3.96"},
"prox5": { "id": 9, "online": 1, "ip": "10.20.3.94"}
}
}
# pvecm status
Quorum information
------------------
Date: Thu Mar 23 17:34:50 2017
Quorum provider: corosync_votequorum
Nodes: 10
Node ID: 0x00000001
Ring ID: 3/724
Quorate: Yes
Votequorum information
----------------------
Expected votes: 10
Highest expected: 10
Total votes: 10
Quorum: 6
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.20.1.90
0x00000004 1 10.20.1.92
0x00000005 1 10.20.1.94
0x00000006 1 10.20.1.96
0x00000001 1 10.20.1.98 (local)
0x00000002 1 10.20.1.120
0x00000007 1 10.20.3.90
0x00000008 1 10.20.3.92
0x00000009 1 10.20.3.94
0x0000000a 1 10.20.3.96
# pvesm status
local dir 1 997426160 829207300 117529428 88.09%
qnap nfs 1 16194897920 5106433024 11088464896 32.03%
# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
3 1 prox7
4 1 prox8
5 1 prox9
6 1 prox10
1 1 prox1 (local)
2 1 prox2
7 1 prox3
8 1 prox4
9 1 prox5
10 1 prox6
# etc hosts only shows entries for the local and 1 other node
# authorized keys shows valid keys for all nodes
# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-2 (running version: 4.4-2/80259e05)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-84
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-89
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
#pve version of other nodes:
root@prox3:~/.ssh# pveversion -v
proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
When the server rebooted I noticed the cluster gui (which is running on the last unupgraded node) changed and all nodes except it self went from green to red. All VM's are running I still have a quorum but no longer have any vnc console connection for any vm. Logging into the gui of any of the other nodes shows them as green and give me stats but the other nodes are considered offline.
Ultimately, I'm looking for some advice on how to possibly correct this situation without a full restart of all 10 nodes. And if I have to do a full restart, what should I watch out for on bringing the cluster back up.
Thanks for any assistance.
I have tried to restart pvestatd but it fails with the following kernel message.
Mar 23 15:59:24 prox1 systemd[1]: Stopping PVE Status Daemon...
Mar 23 16:00:54 prox1 systemd[1]: pvestatd.service stopping timed out. Terminating.
Mar 23 16:00:54 prox1 pvestatd[2073]: received signal TERM
Mar 23 16:00:54 prox1 pvestatd[2073]: server closing
Mar 23 16:00:54 prox1 pvestatd[2073]: server stopped
Mar 23 16:01:28 prox1 kernel: [1330092.292096] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:01:28 prox1 kernel: [1330092.292117] Not tainted 4.4.35-1-pve #1
Mar 23 16:01:28 prox1 kernel: [1330092.292131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:01:28 prox1 kernel: [1330092.292154] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:01:28 prox1 kernel: [1330092.292158] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:01:28 prox1 kernel: [1330092.292160] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:01:28 prox1 kernel: [1330092.292161] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:01:28 prox1 kernel: [1330092.292164] Call Trace:
Mar 23 16:01:28 prox1 kernel: [1330092.292170] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:01:28 prox1 kernel: [1330092.292172] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:01:28 prox1 kernel: [1330092.292174] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:01:28 prox1 kernel: [1330092.292176] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:01:28 prox1 kernel: [1330092.292181] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:01:28 prox1 kernel: [1330092.292183] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:01:28 prox1 kernel: [1330092.292186] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:02:24 prox1 systemd[1]: pvestatd.service stop-sigterm timed out. Killing.
Mar 23 16:03:28 prox1 kernel: [1330212.303588] INFO: task pvestatd:110868 blocked for more than 120 seconds.
Mar 23 16:03:28 prox1 kernel: [1330212.303609] Not tainted 4.4.35-1-pve #1
Mar 23 16:03:28 prox1 kernel: [1330212.303622] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 23 16:03:28 prox1 kernel: [1330212.303646] pvestatd D ffff8807e4f6bdf8 0 110868 1 0x00000004
Mar 23 16:03:28 prox1 kernel: [1330212.303649] ffff8807e4f6bdf8 ffff88300b4c1a40 ffff88301170e040 ffff88148bbeee00
Mar 23 16:03:28 prox1 kernel: [1330212.303651] ffff8807e4f6c000 ffff881806ca03ac ffff88148bbeee00 00000000ffffffff
Mar 23 16:03:28 prox1 kernel: [1330212.303653] ffff881806ca03b0 ffff8807e4f6be10 ffffffff81856155 ffff881806ca03a8
Mar 23 16:03:28 prox1 kernel: [1330212.303654] Call Trace:
Mar 23 16:03:28 prox1 kernel: [1330212.303661] [<ffffffff81856155>] schedule+0x35/0x80
Mar 23 16:03:28 prox1 kernel: [1330212.303663] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Mar 23 16:03:28 prox1 kernel: [1330212.303664] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Mar 23 16:03:28 prox1 kernel: [1330212.303666] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Mar 23 16:03:28 prox1 kernel: [1330212.303670] [<ffffffff8121e19a>] filename_create+0x7a/0x160
Mar 23 16:03:28 prox1 kernel: [1330212.303672] [<ffffffff8121f133>] SyS_mkdir+0x53/0x100
Mar 23 16:03:28 prox1 kernel: [1330212.303675] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 23 16:03:55 prox1 systemd[1]: pvestatd.service still around after SIGKILL. Ignoring.
there are still systemd commands lurking about to start and stop the process as well.
Here is other pertinent output related to the installation:
# from the un upgraded node 1:
# cat .members
{
"nodename": "prox1",
"version": 124,
"cluster": { "name": "xxxxx", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
"prox2": { "id": 2, "online": 1, "ip": "10.20.1.120"},
"prox1": { "id": 1, "online": 1, "ip": "10.20.1.98"},
"prox8": { "id": 4, "online": 1, "ip": "10.20.1.92"},
"prox9": { "id": 5, "online": 1, "ip": "10.20.1.94"},
"prox10": { "id": 6, "online": 1, "ip": "10.20.1.96"},
"prox3": { "id": 7, "online": 1, "ip": "10.20.3.90"},
"prox4": { "id": 8, "online": 1, "ip": "10.20.3.92"},
"prox7": { "id": 3, "online": 1, "ip": "10.20.1.90"},
"prox6": { "id": 10, "online": 1, "ip": "10.20.3.96"},
"prox5": { "id": 9, "online": 1, "ip": "10.20.3.94"}
}
}
# pvecm status
Quorum information
------------------
Date: Thu Mar 23 17:34:50 2017
Quorum provider: corosync_votequorum
Nodes: 10
Node ID: 0x00000001
Ring ID: 3/724
Quorate: Yes
Votequorum information
----------------------
Expected votes: 10
Highest expected: 10
Total votes: 10
Quorum: 6
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.20.1.90
0x00000004 1 10.20.1.92
0x00000005 1 10.20.1.94
0x00000006 1 10.20.1.96
0x00000001 1 10.20.1.98 (local)
0x00000002 1 10.20.1.120
0x00000007 1 10.20.3.90
0x00000008 1 10.20.3.92
0x00000009 1 10.20.3.94
0x0000000a 1 10.20.3.96
# pvesm status
local dir 1 997426160 829207300 117529428 88.09%
qnap nfs 1 16194897920 5106433024 11088464896 32.03%
# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
3 1 prox7
4 1 prox8
5 1 prox9
6 1 prox10
1 1 prox1 (local)
2 1 prox2
7 1 prox3
8 1 prox4
9 1 prox5
10 1 prox6
# etc hosts only shows entries for the local and 1 other node
# authorized keys shows valid keys for all nodes
# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-2 (running version: 4.4-2/80259e05)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-84
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-89
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
#pve version of other nodes:
root@prox3:~/.ssh# pveversion -v
proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80