[SOLVED] reboots hang with "watchdog did not stop"

RobFantini · Jan 8, 2017

Hello

at least since mid December 2016 3 of 4 nodes take a long time to reboot. when here I do a manual reset.

at system console there is this: [ may not be exact]

Code:

watchdog  watchdog0: watchdog did not stop!

as far as I know the system will eventually restart after what seems like a long time.

version info :

Code:

# pveversion --verbose
proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1

fireon · Jan 8, 2017

Same Problem here, same Systemversion.

fireon · Feb 19, 2017

Look likes this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1365352 Is fixed in kernel 4.8.8

mir · Feb 19, 2017

RobFantini said:
Hello

at least since mid December 2016 3 of 4 nodes take a long time to reboot. when here I do a manual reset.

at system console there is this: [ may not be exact]

Code:

watchdog watchdog0: watchdog did not stop!

as far as I know the system will eventually restart after what seems like a long time.

version info :

Code:

# pveversion --verbose proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)

I see you are not running latest kernel. Maybe a kernel upgrade will resolve the issue.

fireon · Feb 19, 2017

mir said:
I see you are not running latest kernel. Maybe a kernel upgrade will resolve the issue.

Did not. It is really depending on hardware. Never had this on new DELL Servers, but on every new supermicro with zfs, and sometimes some HP ML350.

Kevo · Aug 12, 2017

I'm also having this issue on my new v5 installs. Is there any solution/workaround for this yet? It really adds a significant delay to reboots.

fireon · Aug 12, 2017

Kevo said:
I'm also having this issue on my new v5 installs. Is there any solution/workaround for this yet? It really adds a significant delay to reboots.

Same sometimes here. So... NOT FIXED!

Kevo · Aug 12, 2017

I've modified the systemd config file so the watchdog timeout is 10 seconds. That seems to limit the delay to about 10 seconds, so I'm going to use that as a work around for now.

fireon · Aug 12, 2017

Kevo said:
I've modified the systemd config file so the watchdog timeout is 10 seconds. That seems to limit the delay to about 10 seconds, so I'm going to use that as a work around for now.

Maybe this is bad when i have an Cluster?

rwadi · Aug 14, 2017

We are all seeing this on new v5 installs. Updates are coming from the enterprise repo and we are fully updated.

Rebooting a node sites on watchdog watchdog0: watchdog did not stop! for several minutes before the host reboots.

Wolfram Schlich · Aug 22, 2017

I have the issue with a new v5 install, enterprise repo, all updates installed (kernel 4.10.17-20) on an Intel NUC6i7KYK.
The system sits and waits after 'systemd-shutdown[1]: Sending SIGTERM to remaining processes...' for 90s until it continues with 'systemd-shutdown[1]: Sending SIGKILL to remaining processes...' which in turn kills a running dmeventd.
So maybe dmeventd does not die from SIGTERM and therefore is being hard-killed after 90s?

EDIT: dmeventd not stopping on SIGTERM seems to be related to (for example) using LVM thin pools:
https://www.redhat.com/archives/dm-devel/2016-August/msg00075.html
https://www.redhat.com/archives/dm-devel/2016-August/msg00302.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00034.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00036.html
https://www.redhat.com/archives/dm-devel/2016-September/msg00041.html
https://bugs.archlinux.org/task/50420

RobFantini · Aug 22, 2017

there are some timeout values that might help by lowering at

Code:

/etc/systemd/system/multi-user.target.wants/pve-ha-crm.service
#and
/etc/systemd/system/multi-user.target.wants/pve-ha-lrm.service

fabian · Aug 22, 2017

RobFantini said:
there are some timeout values that might help by lowering at

Code:

/etc/systemd/system/multi-user.target.wants/pve-ha-crm.service #and /etc/systemd/system/multi-user.target.wants/pve-ha-lrm.service

those are different timeouts / watchdogs , and lowering them will just lead to more forcefully killed HA-guests.

Wolfram Schlich · Aug 22, 2017

@fabian what do you think about what I wrote regarding dmeventd?

fabian · Aug 23, 2017

Wolfram Schlich said:
@fabian what do you think about what I wrote regarding dmeventd?

please post the output of:

pveversion -v
grep -v "^$\s*#\|\s*$$" /etc/lvm/lvm.conf
full journalctl output each of the boot and shutdown timespan of a system boot where shutdown was hanging

sirsean12 · Nov 5, 2017

Is there still not a fix for this? I am currently working with three nodes connected to iscsi for VM storage. I have been testing out Proxmox for about three months and have a presentation at my job this coming Friday(November 10th) , in hopes that I can convince the team to switch most of our clients over to Proxmox from VMWare. We would be buying subscriptions and after two months of a nothing burger, this does not reflect well on your support. Below is my info that you wanted us to provide.

I am not Using CEPH, it is installed, as I was messing around with it but I am NOT using it.

Thank you!

pveversion -v

proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90
ceph: 12.2.1-pve3

grep -v "^$\s*#\|\s*$$" /etc/lvm/lvm.conf

config {
checks = 1
abort_on_errors = 0
profile_dir = "/etc/lvm/profile"
}
devices {
dir = "/dev"
scan = [ "/dev" ]
obtain_device_list_from_udev = 1
external_device_info_source = "none"
global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|" ]
cache_dir = "/run/lvm"
cache_file_prefix = ""
write_cache_state = 1
sysfs_scan = 1
multipath_component_detection = 1
md_component_detection = 1
fw_raid_component_detection = 0
md_chunk_alignment = 1
data_alignment_detection = 1
data_alignment = 0
data_alignment_offset_detection = 1
ignore_suspended_devices = 0
ignore_lvm_mirrors = 1
disable_after_error_count = 0
require_restorefile_with_uuid = 1
pv_min_size = 2048
issue_discards = 1
allow_changes_with_duplicate_pvs = 0
}
allocation {
maximise_cling = 1
use_blkid_wiping = 1
wipe_signatures_when_zeroing_new_lvs = 1
mirror_logs_require_separate_pvs = 0
cache_pool_metadata_require_separate_pvs = 0
thin_pool_metadata_require_separate_pvs = 0
}
log {
verbose = 0
silent = 0
syslog = 1
overwrite = 0
level = 0
indent = 1
command_names = 0
prefix = " "
activation = 0
debug_classes = [ "memory", "devices", "activation", "allocation", "lvmetad", "metadata", "cache", "locking", "lvmpolld", "dbus" ]
}
backup {
backup = 1
backup_dir = "/etc/lvm/backup"
archive = 1
archive_dir = "/etc/lvm/archive"
retain_min = 10
retain_days = 30
}
shell {
history_size = 100
}
global {
umask = 077
test = 0
units = "h"
si_unit_consistency = 1
suffix = 1
activation = 1
proc = "/proc"
etc = "/etc"
locking_type = 1
wait_for_locks = 1
fallback_to_clustered_locking = 1
fallback_to_local_locking = 1
locking_dir = "/run/lock/lvm"
prioritise_write_locks = 1
abort_on_internal_errors = 0
detect_internal_vg_cache_corruption = 0
metadata_read_only = 0
mirror_segtype_default = "raid1"
raid10_segtype_default = "raid10"
sparse_segtype_default = "thin"
use_lvmetad = 0
use_lvmlockd = 0
system_id_source = "none"
use_lvmpolld = 1
notify_dbus = 1
}
activation {
checks = 0
udev_sync = 1
udev_rules = 1
verify_udev_operations = 0
retry_deactivation = 1
missing_stripe_filler = "error"
use_linear_target = 1
reserved_stack = 64
reserved_memory = 8192
process_priority = -18
raid_region_size = 512
readahead = "auto"
raid_fault_policy = "warn"
mirror_image_fault_policy = "remove"
mirror_log_fault_policy = "allocate"
snapshot_autoextend_threshold = 100
snapshot_autoextend_percent = 20
thin_pool_autoextend_threshold = 100
thin_pool_autoextend_percent = 20
use_mlockall = 0
monitoring = 1
polling_interval = 15
activation_mode = "degraded"
}
dmeventd {
mirror_library = "libdevmapper-event-lvm2mirror.so"
snapshot_library = "libdevmapper-event-lvm2snapshot.so"
thin_library = "libdevmapper-event-lvm2thin.so"
}

I cannot find any journalctl output for anything older than my last start up.

RobFantini · Nov 5, 2017

update: this also happens on 2 of my laptops, two thinkpad . the delay is not as long.

i think this is a hardware / bios / kernel interaction thing. We have supermicro and hardware and had a dell with the issue. What kind of hardware are you using?

To deal with this: at this point when rebooting a node we use ssh and ipmi to reset the hardware as soon as the message comes up. I like looking at a node reboot so this is not a pain. just a couple of clicks.

I think the watchdog thing is an unintended consequence of 10,000 other improvements.
and I am not complaining. there is no such thing as perfect software . a bug report on this needs to go where? probably here as there are more users of high availability gnu/debian/linux software and hardware here then anywhere else i know of.

so this thread should move to a bug report. can you file a bug? just link to the thread.

sirsean12 · Nov 5, 2017

RobFantini said:

update: this also happens on 2 of my laptops, two thinkpad . the delay is not as long.

i think this is a hardware / bios / kernel interaction thing. We have supermicro and hardware and had a dell with the issue. What kind of hardware are you using?

To deal with this: at this point when rebooting a node we use ssh and ipmi to reset the hardware as soon as the message comes up. I like looking at a node reboot so this is not a pain. just a couple of clicks.

I think the watchdog thing is an unintended consequence of 10,000 other improvements.
and I am not complaining. there is no such thing as perfect software . a bug report on this needs to go where? probably here as there are more users of high availability gnu/debian/linux software and hardware here then anywhere else i know of.

Click to expand...

RobFantini said:
so this thread should move to a bug report. can you file a bug? just link to the thread.

Thanks for the input Rob! As it so happens, this is on a SuperMicro server. Strangely it seems to have stopped since I wrote my comment. Perhaps this happens when proxmox is not aware that the nodes received a stop command and it is trying to spin up HA?

I have no idea how to file a bug report.

RobFantini · Nov 5, 2017

here is the bug link https://bugzilla.proxmox.com/index.cgi

it is easy to do . just create an account and file a bug report . for the 'Component:' sometime you just need to take a good guess .

RobFantini · Nov 8, 2017

we upgraded and reinstalled a few hosts today . none had the watchdog issue.

[SOLVED] reboots hang with "watchdog did not stop"

Famous Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

New Member

New Member

Famous Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Member

Famous Member

Member

Famous Member

Famous Member