[SOLVED] reboots hang with "watchdog did not stop"

RobFantini

Famous Member
May 24, 2012
2,023
107
133
Boston,Mass
Hello

at least since mid December 2016 3 of 4 nodes take a long time to reboot. when here I do a manual reset.

at system console there is this: [ may not be exact]

Code:
watchdog  watchdog0: watchdog did not stop!

as far as I know the system will eventually restart after what seems like a long time.

version info :
Code:
# pveversion --verbose
proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1
 
Hello

at least since mid December 2016 3 of 4 nodes take a long time to reboot. when here I do a manual reset.

at system console there is this: [ may not be exact]

Code:
watchdog  watchdog0: watchdog did not stop!

as far as I know the system will eventually restart after what seems like a long time.

version info :
Code:
# pveversion --verbose
proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)
I see you are not running latest kernel. Maybe a kernel upgrade will resolve the issue.
 
I see you are not running latest kernel. Maybe a kernel upgrade will resolve the issue.
Did not. It is really depending on hardware. Never had this on new DELL Servers, but on every new supermicro with zfs, and sometimes some HP ML350.
 
I'm also having this issue on my new v5 installs. Is there any solution/workaround for this yet? It really adds a significant delay to reboots.
 
I'm also having this issue on my new v5 installs. Is there any solution/workaround for this yet? It really adds a significant delay to reboots.
Same sometimes here. So... NOT FIXED!
 
I've modified the systemd config file so the watchdog timeout is 10 seconds. That seems to limit the delay to about 10 seconds, so I'm going to use that as a work around for now.
 
I've modified the systemd config file so the watchdog timeout is 10 seconds. That seems to limit the delay to about 10 seconds, so I'm going to use that as a work around for now.
Maybe this is bad when i have an Cluster?
 
We are all seeing this on new v5 installs. Updates are coming from the enterprise repo and we are fully updated.

Rebooting a node sites on watchdog watchdog0: watchdog did not stop! for several minutes before the host reboots.
 
I have the issue with a new v5 install, enterprise repo, all updates installed (kernel 4.10.17-20) on an Intel NUC6i7KYK.
The system sits and waits after 'systemd-shutdown[1]: Sending SIGTERM to remaining processes...' for 90s until it continues with 'systemd-shutdown[1]: Sending SIGKILL to remaining processes...' which in turn kills a running dmeventd.
So maybe dmeventd does not die from SIGTERM and therefore is being hard-killed after 90s?

EDIT: dmeventd not stopping on SIGTERM seems to be related to (for example) using LVM thin pools:
https://www.redhat.com/archives/dm-devel/2016-August/msg00075.html
https://www.redhat.com/archives/dm-devel/2016-August/msg00302.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00034.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00036.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00041.html
https://bugs.archlinux.org/task/50420
 
Last edited:
there are some timeout values that might help by lowering at
Code:
/etc/systemd/system/multi-user.target.wants/pve-ha-crm.service
#and
/etc/systemd/system/multi-user.target.wants/pve-ha-lrm.service
 
there are some timeout values that might help by lowering at
Code:
/etc/systemd/system/multi-user.target.wants/pve-ha-crm.service
#and
/etc/systemd/system/multi-user.target.wants/pve-ha-lrm.service

those are different timeouts / watchdogs , and lowering them will just lead to more forcefully killed HA-guests.
 
@fabian what do you think about what I wrote regarding dmeventd?

please post the output of:
  • pveversion -v
  • grep -v "^\(\s*#\|\s*$\)" /etc/lvm/lvm.conf
  • full journalctl output each of the boot and shutdown timespan of a system boot where shutdown was hanging
 
Is there still not a fix for this? I am currently working with three nodes connected to iscsi for VM storage. I have been testing out Proxmox for about three months and have a presentation at my job this coming Friday(November 10th) , in hopes that I can convince the team to switch most of our clients over to Proxmox from VMWare. We would be buying subscriptions and after two months of a nothing burger, this does not reflect well on your support. Below is my info that you wanted us to provide.

I am not Using CEPH, it is installed, as I was messing around with it but I am NOT using it.

Thank you!

pveversion -v


proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90
ceph: 12.2.1-pve3


grep -v "^\(\s*#\|\s*$\)" /etc/lvm/lvm.conf


config {
checks = 1
abort_on_errors = 0
profile_dir = "/etc/lvm/profile"
}
devices {
dir = "/dev"
scan = [ "/dev" ]
obtain_device_list_from_udev = 1
external_device_info_source = "none"
global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|" ]
cache_dir = "/run/lvm"
cache_file_prefix = ""
write_cache_state = 1
sysfs_scan = 1
multipath_component_detection = 1
md_component_detection = 1
fw_raid_component_detection = 0
md_chunk_alignment = 1
data_alignment_detection = 1
data_alignment = 0
data_alignment_offset_detection = 1
ignore_suspended_devices = 0
ignore_lvm_mirrors = 1
disable_after_error_count = 0
require_restorefile_with_uuid = 1
pv_min_size = 2048
issue_discards = 1
allow_changes_with_duplicate_pvs = 0
}
allocation {
maximise_cling = 1
use_blkid_wiping = 1
wipe_signatures_when_zeroing_new_lvs = 1
mirror_logs_require_separate_pvs = 0
cache_pool_metadata_require_separate_pvs = 0
thin_pool_metadata_require_separate_pvs = 0
}
log {
verbose = 0
silent = 0
syslog = 1
overwrite = 0
level = 0
indent = 1
command_names = 0
prefix = " "
activation = 0
debug_classes = [ "memory", "devices", "activation", "allocation", "lvmetad", "metadata", "cache", "locking", "lvmpolld", "dbus" ]
}
backup {
backup = 1
backup_dir = "/etc/lvm/backup"
archive = 1
archive_dir = "/etc/lvm/archive"
retain_min = 10
retain_days = 30
}
shell {
history_size = 100
}
global {
umask = 077
test = 0
units = "h"
si_unit_consistency = 1
suffix = 1
activation = 1
proc = "/proc"
etc = "/etc"
locking_type = 1
wait_for_locks = 1
fallback_to_clustered_locking = 1
fallback_to_local_locking = 1
locking_dir = "/run/lock/lvm"
prioritise_write_locks = 1
abort_on_internal_errors = 0
detect_internal_vg_cache_corruption = 0
metadata_read_only = 0
mirror_segtype_default = "raid1"
raid10_segtype_default = "raid10"
sparse_segtype_default = "thin"
use_lvmetad = 0
use_lvmlockd = 0
system_id_source = "none"
use_lvmpolld = 1
notify_dbus = 1
}
activation {
checks = 0
udev_sync = 1
udev_rules = 1
verify_udev_operations = 0
retry_deactivation = 1
missing_stripe_filler = "error"
use_linear_target = 1
reserved_stack = 64
reserved_memory = 8192
process_priority = -18
raid_region_size = 512
readahead = "auto"
raid_fault_policy = "warn"
mirror_image_fault_policy = "remove"
mirror_log_fault_policy = "allocate"
snapshot_autoextend_threshold = 100
snapshot_autoextend_percent = 20
thin_pool_autoextend_threshold = 100
thin_pool_autoextend_percent = 20
use_mlockall = 0
monitoring = 1
polling_interval = 15
activation_mode = "degraded"
}
dmeventd {
mirror_library = "libdevmapper-event-lvm2mirror.so"
snapshot_library = "libdevmapper-event-lvm2snapshot.so"
thin_library = "libdevmapper-event-lvm2thin.so"
}



I cannot find any journalctl output for anything older than my last start up.
 
Last edited:
update: this also happens on 2 of my laptops, two thinkpad . the delay is not as long.

i think this is a hardware / bios / kernel interaction thing. We have supermicro and hardware and had a dell with the issue. What kind of hardware are you using?

To deal with this: at this point when rebooting a node we use ssh and ipmi to reset the hardware as soon as the message comes up. I like looking at a node reboot so this is not a pain. just a couple of clicks.

I think the watchdog thing is an unintended consequence of 10,000 other improvements.
and I am not complaining. there is no such thing as perfect software . a bug report on this needs to go where? probably here as there are more users of high availability gnu/debian/linux software and hardware here then anywhere else i know of.

so this thread should move to a bug report. can you file a bug? just link to the thread.
 
update: this also happens on 2 of my laptops, two thinkpad . the delay is not as long.

i think this is a hardware / bios / kernel interaction thing. We have supermicro and hardware and had a dell with the issue. What kind of hardware are you using?

To deal with this: at this point when rebooting a node we use ssh and ipmi to reset the hardware as soon as the message comes up. I like looking at a node reboot so this is not a pain. just a couple of clicks.

I think the watchdog thing is an unintended consequence of 10,000 other improvements.
and I am not complaining. there is no such thing as perfect software . a bug report on this needs to go where? probably here as there are more users of high availability gnu/debian/linux software and hardware here then anywhere else i know of.
so this thread should move to a bug report. can you file a bug? just link to the thread.


Thanks for the input Rob! As it so happens, this is on a SuperMicro server. Strangely it seems to have stopped since I wrote my comment. Perhaps this happens when proxmox is not aware that the nodes received a stop command and it is trying to spin up HA?

I have no idea how to file a bug report.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!