node would not restart, now has a /etc/pve issue

RobFantini

Famous Member
May 24, 2012
2,083
116
133
Boston,Mass
Hello

When I tried to stop a node, it got stuck here ( from ps afx ] :
Code:
   6061 ?        SL     0:00  \_ startpar -p 4 -t 20 -T 3 -M stop -P 2 -R 6
   6534 ?        S      0:00      \_ /bin/sh /etc/init.d/vz stop
   6583 ?        R      0:48          \_ /sbin/modprobe -r ip_nat_ftp

there were no vz's on this node.

after 5 minutes I used kill -9 6534 un stick it.

while the node was down I made a change to /etc/pve/cluster.conf . that should not be an issue when a node is off? we have 4 nodes in this cluster.

yet here is output of fbc3 /etc/pve # ls -l /etc/pve/cluster*

bad node:
Code:
-r--r----- 1 root www-data 1393 Aug 24 11:57 /etc/pve/cluster.conf
-r--r----- 1 root www-data 1393 Aug 27 12:33 /etc/pve/cluster.conf.new


good nodes have this:
Code:
# node  fbc87
-rw-r----- 1 root www-data 1505 Aug 27 12:44 cluster.conf
-rw-r----- 1 root www-data 1432 Aug 23 07:56 cluster.conf.old

# node fbc241
s012  ~ # ls -l /etc/pve/clus*
-rw-r----- 1 root www-data 1505 Aug 27 12:44 /etc/pve/cluster.conf
-rw-r----- 1 root www-data 1432 Aug 23 07:56 /etc/pve/cluster.conf.old


more info:
Code:
fbc87  /etc/pve # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X   1112                        fbc3
   2   M    828   2013-08-03 14:00:13  fbc87
   3   M   1072   2013-08-26 15:35:21  s035
   5   M    972   2013-08-22 18:47:55  s012


*bad node pveversion at time of reboot:
Code:
fbc3  /var/lib/vz/private # pveversion -v
proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-7
qemu-server: 3.1-1
pve-firmware: 1.0-23
libpve-common-perl: 3.0-6
libpve-access-control: 3.0-6
libpve-storage-perl: 3.0-10
pve-libspice-server1: 0.12.4-1
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.0-2

I upgraded after reboot, restarted and have the same issue.


Any suggestions to fix?
 
I moved the network connection to a dumb switch and after two more reboots the node came back up.

Originally it was connected to a Netgear layer 3 switch.
 
Just tested a shutdown , which still hangs here:
Code:
  14406 ?        SL     0:00  \_ startpar -p 4 -t 20 -T 3 -M stop -P 2 -R 0
  14820 ?        S      0:00      \_ /bin/sh /etc/init.d/vz stop
  14869 ?        R      0:12          \_ /sbin/modprobe -r ip_nat_ftp
 
What is the output of

# dpkg -l fuse-utils

Please remove that packages if it is still installed:

# apt-get remove fuse-utils
 
fuse-utils was installed, I removed.

The issue is still occurring. I rebooted two times and ps afx still shows:
Code:
 7946 ?        Ss     0:00 /bin/sh /etc/init.d/rc 6
   7952 ?        SL     0:00  \_ startpar -p 4 -t 20 -T 3 -M stop -P 2 -R 6
   8339 ?        S      0:00      \_ /bin/sh /etc/init.d/vz stop
   8419 ?        R      2:50          \_ /sbin/modprobe -r ip_nat_ftp

Not this system was a pve + desktop system, now it is just a testing node.
here is most of ps afx output besides the kthreadd stuff:
Code:
   4257 ?        S      0:00  \_ [nfsio]
    586 ?        Ss     0:00 udevd --daemon
   3731 ?        S      0:00  \_ udevd --daemon
   4226 ?        S      0:00  \_ udevd --daemon
   2401 ?        Ss     0:00 /sbin/rpcbind -w
   2419 ?        Ss     0:00 /sbin/rpc.statd
   2442 ?        Ss     0:00 /usr/sbin/rpc.idmapd
   2642 ?        Ss     0:00 /usr/sbin/iscsid
   2643 ?        S<Ls   0:00 /usr/sbin/iscsid
   2785 ?        Sl     0:00 /usr/sbin/rsyslogd -c5
   2870 ?        Ss     0:00 /usr/sbin/vzeventd
   3058 ?        Ss     0:00 /usr/sbin/acpid
   3092 ?        S      0:00 /usr/sbin/dnsmasq -x /var/run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dp
   3115 ?        Ss     0:00 /usr/bin/dbus-daemon --system
   3118 ?        Sl     0:00 /usr/sbin/console-kit-daemon --no-daemon
   3188 ?        Sl     0:00 /usr/lib/policykit-1/polkitd --no-debug
   3232 ?        Ss     0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 101:104
   3438 ?        Ss     0:00 /usr/sbin/sshd
   3459 ?        Ss     0:00  \_ sshd: root@pts/0 
   3467 pts/0    Ss     0:00      \_ -bash
   8485 pts/0    R+     0:00          \_ ps afx
   3453 ?        Ssl    0:00 /usr/bin/pmxcfs
   3680 ?        Ss     0:00 /usr/sbin/cron
   3745 ?        S      0:00  \_ /USR/SBIN/CRON
   3754 ?        Ss     0:00      \_ /bin/sh -c /fbc/bin/linux-server-reboot-cronjob
   3756 ?        S      0:00          \_ /bin/sh /fbc/bin/linux-server-reboot-cronjob
   8469 ?        S      0:00              \_ sleep 240
   3759 ?        Ss     0:00 /usr/sbin/cupsd -C /etc/cups/cupsd.conf
   3814 ?        Sl     0:00 /usr/lib/x86_64-linux-gnu/colord/colord
   3871 ?        S<Lsl   0:01 corosync -f
   3960 ?        Ssl    0:00 fenced
   3985 ?        Ssl    0:00 dlm_controld
   7917 ?        Sl     0:00 /usr/lib/packagekit/packagekitd
   7946 ?        Ss     0:00 /bin/sh /etc/init.d/rc 6
   7952 ?        SL     0:00  \_ startpar -p 4 -t 20 -T 3 -M stop -P 2 -R 6
   8339 ?        S      0:00      \_ /bin/sh /etc/init.d/vz stop
   8419 ?        R      3:56          \_ /sbin/modprobe -r ip_nat_ftp

This is a system that can be re-installed, or we can take the time to try to solve this. No hurry but if someone has a suggestion please respond.