Need help troubleshooting Proxmox failing to restart - syslog included

Jay L

Active Member
Jan 27, 2018
19
3
43
54
Hi,

I have an issue with Proxmox not coming up after a reboot.

Code:
uname -a
Linux proxmox1 4.13.16-2-pve #1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12 +0200) x86_64 GNU/Linux

Update: adding PVE version:

Code:
root@proxmox1:~# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

I am running PM with LXC containers and experience the following:
  1. The core Linux system comes up as I can SSH into the Proxmox host using the root login
  2. None of the containers come up even though they are configured to auto start. I see the following
    1. "pct list" and "pct status <vmid>" time out
    2. "pct start" fails too
    3. The web gui is inaccesible
    4. I even tried "reboot" and the system hung without rebooting. ("reboot -f" worked)
This is a recurring problem and the system eventually comes up after multiple reboots.

I have never seen this issue in previous Proxmox versions and am wondering how I troubleshoot it. (My concern is that this could be a hardware issue. Could it be a bug?) I have full root access to the server and so can access any logs that are required.

Thank you in advance.
 
Last edited:
Hi,

So I have been reviewing the logs and find something very interesting. This morning around 8:30 I changed the network settings in the PM host and then issued the "reboot" command. However, wtmp does not show a reboot until 8:40. As I recall, I did issue a "shutdown -r" around 8:40.

I think that the problem could have been that the the system did not fully shutdown initially and potentially got stuck somehow. It is not clear if my "shutdown -r" helped at all or if it just happened to coincide when the system successfully rebooted.

The relevant syslog from this time is below. A couple of things jump out at me:
  1. I triggered the shutdown at 8:30:14, and it does not appear to complete until 8:39:16.
  2. There is also a weird gap from 8:31:01 and 8:37:46.
  3. SSH was still running on the Proxmox through 8:38:13 because I was able to SSH in at that time
Is there something obviously problematic here?

Code:
Apr 25 08:30:14 proxmox1 pvedaemon[1564]: <root@pam> successful auth for user 'root@pam'
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target Sound Card.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping User Manager for UID 0...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target Graphical Interface.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target Multi-User System.
Apr 25 08:30:49 proxmox1 systemd[10648]: Stopped target Default.
Apr 25 08:30:49 proxmox1 systemd[10648]: Stopped target Basic System.
Apr 25 08:30:49 proxmox1 systemd[10648]: Stopped target Sockets.
Apr 25 08:30:49 proxmox1 systemd[10648]: Closed GnuPG cryptographic agent (access for web browsers).
Apr 25 08:30:49 proxmox1 systemd[10648]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Apr 25 08:30:49 proxmox1 smartd[850]: smartd received signal 15: Terminated
Apr 25 08:30:49 proxmox1 systemd[10648]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Apr 25 08:30:49 proxmox1 smartd[850]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.LITEONIT_LMH_256V2M_41_MSATA_256GB_SED-TW035DNT5508565KH0UZ.ata.state
Apr 25 08:30:49 proxmox1 systemd[10648]: Closed GnuPG network certificate management daemon.
Apr 25 08:30:49 proxmox1 smartd[850]: smartd is exiting (exit status 0)
Apr 25 08:30:49 proxmox1 systemd[10648]: Closed GnuPG cryptographic agent and passphrase cache.
Apr 25 08:30:49 proxmox1 systemd[10648]: Reached target Shutdown.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping LXC Container Monitoring Daemon...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target Login Prompts.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Regular background program processing daemon...
Apr 25 08:30:49 proxmox1 systemd[10648]: Starting Exit the Session...
Apr 25 08:30:49 proxmox1 systemd[10648]: Stopped target Paths.
Apr 25 08:30:49 proxmox1 systemd[10648]: Stopped target Timers.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Postfix Mail Transport Agent.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target ZFS startup target.
Apr 25 08:30:49 proxmox1 systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target RPC Port Mapper.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
Apr 25 08:30:49 proxmox1 systemd[1]: Unmounting RPC Pipe File System...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped ZFS file system shares.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target ZFS pool import target.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped target Timers.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Proxmox VE replication runner.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Daily PVE download activities.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Daily apt upgrade and clean activities.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Daily apt download activities.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Session 92 of user root.
Apr 25 08:30:49 proxmox1 postfix[12144]: Postfix is running with backwards-compatible default settings
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping PVE guests...
Apr 25 08:30:49 proxmox1 postfix[12144]: See http://www.postfix.org/COMPATIBILITY_README.html for details
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Self Monitoring and Reporting Technology (SMART) Daemon...
Apr 25 08:30:49 proxmox1 postfix[12144]: To disable backwards compatibility use "postconf compatibility_level=2" and "postfix reload"
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping D-Bus System Message Bus...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping LSB: Ceph RBD Mapping...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Getty on tty1...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Kernel Samepage Merging (KSM) Tuning Daemon.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped LXC Container Monitoring Daemon.
Apr 25 08:30:49 proxmox1 systemd[10648]: Received SIGRTMIN+24 from PID 12117 (kill).
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Getty on tty1.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Regular background program processing daemon.
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Connection reset by peer
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Unmounted RPC Pipe File System.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped User Manager for UID 0.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Session 92 of user root.
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Apr 25 08:30:49 proxmox1 systemd[1]: Removed slice User Slice of root.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Login Service...
Apr 25 08:30:49 proxmox1 systemd[1]: Stopping Permit User Sessions...
Apr 25 08:30:49 proxmox1 systemd[1]: Removed slice system-getty.slice.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Permit User Sessions.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped D-Bus System Message Bus.
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped LSB: Ceph RBD Mapping.
Apr 25 08:30:49 proxmox1 postfix/postfix-script[12164]: stopping the Postfix mail system
Apr 25 08:30:49 proxmox1 postfix/master[1474]: terminating on signal 15
Apr 25 08:30:49 proxmox1 systemd[1]: Stopped Postfix Mail Transport Agent (instance -).
Apr 25 08:30:49 proxmox1 systemd[1]: Removed slice system-postfix.slice.
Apr 25 08:30:50 proxmox1 systemd[1]: Stopped Login Service.
Apr 25 08:30:53 proxmox1 pve-guests[12209]: <root@pam> starting task UPID:proxmox1:00002FB9:004301A9:5AE074FD:stopall::root@pam:
Apr 25 08:30:53 proxmox1 pvesh[12209]: Stopping CT 139 (timeout = 60 seconds)
Apr 25 08:30:53 proxmox1 pve-guests[12217]: <root@pam> starting task UPID:proxmox1:00002FBB:004301AD:5AE074FD:vzshutdown:139:root@pam:
Apr 25 08:30:53 proxmox1 pve-guests[12219]: shutdown CT 139: UPID:proxmox1:00002FBB:004301AD:5AE074FD:vzshutdown:139:root@pam:
Apr 25 08:30:53 proxmox1 pvesh[12209]: Stopping CT 138 (timeout = 60 seconds)
Apr 25 08:30:53 proxmox1 pve-guests[12217]: <root@pam> starting task UPID:proxmox1:00002FBD:004301B0:5AE074FD:vzshutdown:138:root@pam:
Apr 25 08:30:53 proxmox1 pve-guests[12221]: shutdown CT 138: UPID:proxmox1:00002FBD:004301B0:5AE074FD:vzshutdown:138:root@pam:
Apr 25 08:30:53 proxmox1 pvesh[12209]: Stopping CT 137 (timeout = 60 seconds)
Apr 25 08:30:53 proxmox1 pve-guests[12217]: <root@pam> starting task UPID:proxmox1:00002FC0:004301B2:5AE074FD:vzshutdown:137:root@pam:
Apr 25 08:30:53 proxmox1 pve-guests[12224]: shutdown CT 137: UPID:proxmox1:00002FC0:004301B2:5AE074FD:vzshutdown:137:root@pam:
Apr 25 08:30:53 proxmox1 pvesh[12209]: Stopping CT 135 (timeout = 60 seconds)
Apr 25 08:30:53 proxmox1 pve-guests[12217]: <root@pam> starting task UPID:proxmox1:00002FD0:004301B8:5AE074FD:vzshutdown:135:root@pam:
Apr 25 08:30:53 proxmox1 pve-guests[12240]: shutdown CT 135: UPID:proxmox1:00002FD0:004301B8:5AE074FD:vzshutdown:135:root@pam:
Apr 25 08:30:53 proxmox1 kernel: [43914.803598] vmbr0: port 4(veth138i0) entered disabled state
Apr 25 08:30:53 proxmox1 kernel: [43914.829356] device veth138i0 left promiscuous mode
Apr 25 08:30:53 proxmox1 kernel: [43914.829367] vmbr0: port 4(veth138i0) entered disabled state
Apr 25 08:30:53 proxmox1 kernel: [43914.862761] vmbr0: port 5(veth139i0) entered disabled state
Apr 25 08:30:53 proxmox1 kernel: [43914.870542] device veth139i0 left promiscuous mode
Apr 25 08:30:53 proxmox1 kernel: [43914.870556] vmbr0: port 5(veth139i0) entered disabled state
Apr 25 08:30:57 proxmox1 kernel: [43918.223566] vmbr0: port 2(veth135i0) entered disabled state
Apr 25 08:30:57 proxmox1 kernel: [43918.231937] device veth135i0 left promiscuous mode
Apr 25 08:30:57 proxmox1 kernel: [43918.231949] vmbr0: port 2(veth135i0) entered disabled state
Apr 25 08:31:00 proxmox1 pve-guests[12217]: end task UPID:proxmox1:00002FBD:004301B0:5AE074FD:vzshutdown:138:root@pam:
Apr 25 08:31:00 proxmox1 pve-guests[12217]: end task UPID:proxmox1:00002FBB:004301AD:5AE074FD:vzshutdown:139:root@pam:
Apr 25 08:31:01 proxmox1 pve-guests[12217]: end task UPID:proxmox1:00002FD0:004301B8:5AE074FD:vzshutdown:135:root@pam:
Apr 25 08:37:46 proxmox1 pct[12603]: <root@pam> starting task UPID:proxmox1:00003141:0043A30E:5AE0769A:vzstart:135:root@pam:
Apr 25 08:37:46 proxmox1 pct[12609]: starting CT 135: UPID:proxmox1:00003141:0043A30E:5AE0769A:vzstart:135:root@pam:
Apr 25 08:37:46 proxmox1 systemd[1]: Starting PVE LXC Container: 135...
Apr 25 08:37:48 proxmox1 kernel: [44329.307985] EXT4-fs (dm-10): mounted filesystem with ordered data mode. Opts: (null)
Apr 25 08:37:48 proxmox1 kernel: [44329.383835] IPv6: ADDRCONF(NETDEV_UP): veth135i0: link is not ready
Apr 25 08:37:48 proxmox1 systemd-udevd[12623]: Could not generate persistent MAC address for vethOEJSI7: No such file or directory
Apr 25 08:37:49 proxmox1 kernel: [44330.802632] vmbr0: port 2(veth135i0) entered blocking state
Apr 25 08:37:49 proxmox1 kernel: [44330.802637] vmbr0: port 2(veth135i0) entered disabled state
Apr 25 08:37:49 proxmox1 kernel: [44330.802795] device veth135i0 entered promiscuous mode
Apr 25 08:38:13 proxmox1 pct[12609]: command 'systemctl start pve-container@135' failed: received interrupt
Apr 25 08:38:13 proxmox1 pct[12603]: <root@pam> end task UPID:proxmox1:00003141:0043A30E:5AE0769A:vzstart:135:root@pam: command 'systemctl start pve-container@135' failed: received interrupt
Apr 25 08:39:16 proxmox1 systemd[1]: pve-container@135.service: Start operation timed out. Terminating.
Apr 25 08:39:16 proxmox1 systemd[1]: Failed to start PVE LXC Container: 135.
Apr 25 08:39:16 proxmox1 systemd[1]: pve-container@135.service: Unit entered failed state.
Apr 25 08:39:16 proxmox1 systemd[1]: pve-container@135.service: Failed with result 'timeout'.

****Boot Starts Below****

Apr 25 08:42:03 proxmox1 systemd-modules-load[279]: Inserted module 'iscsi_tcp'
Apr 25 08:42:03 proxmox1 systemd-modules-load[279]: Inserted module 'ib_iser'
Apr 25 08:42:03 proxmox1 systemd-modules-load[279]: Inserted module 'vhost_net'
Apr 25 08:42:03 proxmox1 systemd[1]: Starting Flush Journal to Persistent Storage...
Apr 25 08:42:03 proxmox1 systemd[1]: Started udev Kernel Device Manager.
Apr 25 08:42:03 proxmox1 systemd[1]: Started Flush Journal to Persistent Storage.
Apr 25 08:42:03 proxmox1 systemd[1]: Started udev Coldplug all Devices.
Apr 25 08:42:03 proxmox1 systemd[1]: Starting udev Wait for Complete Device Initialization...
Apr 25 08:42:03 proxmox1 systemd[1]: Started Set the console keyboard layout.
Apr 25 08:42:03 proxmox1 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Apr 25 08:42:03 proxmox1 systemd[1]: Starting Load/Save RF Kill Switch Status...
Apr 25 08:42:03 proxmox1 systemd[1]: Found device LITEONIT_LMH-256V2M-41_MSATA_256GB_SED 2.
Apr 25 08:42:03 proxmox1 systemd[1]: Found device /dev/pve/swap.
Apr 25 08:42:03 proxmox1 systemd[1]: Activating swap /dev/pve/swap...
Apr 25 08:42:03 proxmox1 systemd[1]: Started Load/Save RF Kill Switch Status.
Apr 25 08:42:03 proxmox1 systemd[1]: Activated swap /dev/pve/swap.
Apr 25 08:42:03 proxmox1 systemd[1]: Reached target Swap.
Apr 25 08:42:03 proxmox1 systemd-modules-load[279]: Inserted module 'zfs'
and so on...
 
Hi,

what do you see on the console of the proxmox host when you do the reboot? Are you able / have you tried to 'tweak' the 'default' boot stanza being passed to the kernel? (ie, don't let it auto-boot with default!)

My test-dev proxmox box was giving me some pain recently after a reboot, with very little visible on console (ie, nothing ..) . I had to reboot, intervene and edit my boot option stanza,

-- removed the 'quiet' so that I could better see what was happening
-- added the 'fastboot' option to discourage it from doing a FSCK on a disk that it thought was a good idea to force a check on.

in my case this helped - I had a 2Tb non-raid 'junk misc storage drive' attached and mounted and .. default filesystem options decided after my nice long uptime, a forced checkdisk was a good thing. Which was hard to see - when it was 'quiet' booting.

clearly - you may be in a different circumstance entirely. But sometimes by default console can be encouraged to give you more info / if you haven't done this already ?

Tim
 
Tim,

Thank you. I have not checked the console. I will perform some more exploration and capture it to see if I can get more details.

Thank you again.