Node freezed shortly after update to kernel 4.4.8-51

udo

Distinguished Member
Apr 22, 2009
5,977
199
163
Ahrensburg; Germany
Hi,
one of the fresh updated node frreezed today.
Nothing in the logfiles.
Console freezed too - no output about kernel panic.
No events in the ipmi-log.

The node was stable for months with pve3.4 and pve4.2.

Software changes:
Installing of multipath-tools
Wrong installation of ceph 10.2.1 - deinstallation and reinstall of 0.94.7
Update to the latest pve-versions

Code:
pveversion -v
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-51
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
ceph: 0.94.7-1jessie
Udo
 
Hi,
unfortunality the same happens on the next node yesterday. Same hardware (mostly) but no multipath installed.
Code:
root@prox-03:~# pveversion -v
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-51
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
ceph: 0.94.7-1jessie
CPU is an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz.
Enough free ram, no ZFS.

Get an totem-error app. 17 min. before the node was dead and FC-Error twice (after I moved online an VM-disk):
Code:
Jun  9 09:06:57 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:06:57 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:08:40 prox-03 corosync[2408]:  [TOTEM ] A processor failed, forming new configuration.
Jun  9 09:08:41 prox-03 corosync[2408]:  [TOTEM ] A new membership (172.20.2.61:252) was formed. Members
Jun  9 09:08:41 prox-03 corosync[2408]:  [QUORUM] Members[3]: 1 2 3
Jun  9 09:08:41 prox-03 corosync[2408]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun  9 09:10:58 prox-03 pvedaemon[31839]: <udo@ldap> move disk VM 303: move --disk virtio0 --storage fc1_sas_lun2
Jun  9 09:10:58 prox-03 pvedaemon[31839]: <udo@ldap> starting task UPID:prox-03:00001BA9:018EEA74:57591682:qmmove:303:udo@ldap:
Jun  9 09:13:14 prox-03 pvedaemon[10961]: worker exit
Jun  9 09:13:14 prox-03 pvedaemon[2475]: worker 10961 finished
Jun  9 09:13:14 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:13:14 prox-03 pvedaemon[2475]: worker 8685 started
Jun  9 09:13:58 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:14:29 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:14:30 prox-03 kernel: [261618.638080] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.638420] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.638757] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639094] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639439] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639780] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640099] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640413] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640724] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.641037] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.644069] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  0 2002.
Jun  9 09:14:35 prox-03 pvestatd[2433]: status update time (34.462 seconds)
Jun  9 09:15:01 prox-03 CRON[9488]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun  9 09:17:01 prox-03 CRON[10659]: (root) CMD (  cd / && run-parts --report /etc/cron.hourly)
Jun  9 09:18:50 prox-03 kernel: [261878.579255] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.579755] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.580259] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.580743] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.581205] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.581687] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.582151] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.584948] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:54 prox-03 pvedaemon[31839]: worker exit
Jun  9 09:18:54 prox-03 pvedaemon[2475]: worker 31839 finished
Jun  9 09:18:54 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:18:54 prox-03 pvedaemon[2475]: worker 11506 started
Jun  9 09:18:54 prox-03 pvestatd[2433]: status update time (28.839 seconds)
Jun  9 09:21:27 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:22:28 prox-03 pvedaemon[18100]: worker exit
Jun  9 09:22:28 prox-03 pvedaemon[2475]: worker 18100 finished
Jun  9 09:22:28 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:22:28 prox-03 pvedaemon[2475]: worker 13851 started
######### here was the node resetet ##########
Jun  9 09:29:55 prox-03 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="2455" x-info="http://www.rsyslog.com"] start
Jun  9 09:29:55 prox-03 systemd-modules-load[449]: Module 'fuse' is builtin
Jun  9 09:29:55 prox-03 systemd-modules-load[449]: Inserted module 'vhost_net'
Jun  9 09:29:55 prox-03 hdparm[488]: Setting parameters of disc: (none).
Jun  9 09:29:55 prox-03 keyboard-setup[489]: Setting preliminary keymap...done.
Jun  9 09:29:55 prox-03 systemd-fsck[736]: /dev/sdc1: recovering journal
Jun  9 09:29:55 prox-03 systemd-fsck[736]: /dev/sdc1: clean, 16/61038592 files, 33908830/244137472 blocks
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "bacula" now active
Jun  9 09:29:55 prox-03 systemd-fsck[947]: /dev/mapper/bacula-backup: recovering journal
Jun  9 09:29:55 prox-03 lvm[931]: 17 logical volume(s) in volume group "fc1_sas_lun2" now active
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "fc1-ssd-lun1" now active
Jun  9 09:29:55 prox-03 systemd-fsck[947]: /dev/mapper/bacula-backup: clean, 33021/395313152 files, 2018185557/3162505216 blocks
Jun  9 09:29:55 prox-03 lvm[931]: 6 logical volume(s) in volume group "fc1_data_lun0" now active
Jun  9 09:29:55 prox-03 lvm[931]: 4 logical volume(s) in volume group "local_ssd" now active
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "local_pve" now active
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpuset
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpu
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpuacct
Jun  9 09:29:55 prox-03 systemd-fsck[1259]: /dev/mapper/local_pve-data: recovering journal
...
Udo
 
Hi again,
just look in the logfile from the first crash... there is nothing in my eyes (except of an linebreak)
Code:
Jun  7 11:57:23 prox-02 rrdcached[2695]: flushing old values
Jun  7 11:57:23 prox-02 rrdcached[2695]: rotating journals
Jun  7 11:57:23 prox-02 rrdcached[2695]: started new journal /var/lib/rrdcached/journal/rrd.journal.1465293443.229715
Jun  7 11:57:23 prox-02 rrdcached[2695]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1465286243.229732
Jun  7 11:57:40 prox-02 puppet-agent[1885]: Finished catalog run in 1.11 seconds
Jun  7 11:58:40 prox-02 pveproxy[23848]: worker exit
Jun  7 11:58:40 prox-02 pveproxy[29147]: worker 23848 finished
Jun  7 11:58:40 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 11:58:40 prox-02 pveproxy[29147]: worker 3430 started
Jun  7 11:58:40 prox-02 pvedaemon[30198]: <root@pam> successful auth for user 'fxxxxx@ldap'
Jun  7 11:58:50 prox-02 pmxcfs[2886]: [status] notice: received log
Jun  7 11:59:56 prox-02 pveproxy[25187]: worker exit
Jun  7 11:59:56 prox-02 pveproxy[29147]: worker 25187 finished
Jun  7 11:59:56 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 11:59:56 prox-02 pveproxy[29147]: worker 4347 started
Jun  7 12:02:45 prox-02 pveproxy[28676]: worker exit
Jun  7 12:02:45 prox-02 pveproxy[29147]: worker 28676 finished
Jun  7 12:02:45 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 12:02:45 prox-02 pveproxy[29147]: worker 6779 started
Jun  7 12:04:22 prox-02 pmxcfs[2886]: [dcdb] notice: data verification successful
Jun  7 12:04:31 prox-02 pmxcfs[2886]: [status] notice: received log

Jun  7 12:13:53 prox-02 systemd-modules-load[415]: Module 'fuse' is builtin
Jun  7 12:13:53 prox-02 systemd-modules-load[415]: Inserted module 'vhost_net'
Jun  7 12:13:53 prox-02 hdparm[445]: Setting parameters of disc: (none).
Jun  7 12:13:53 prox-02 keyboard-setup[446]: Setting preliminary keymap...done.
Jun  7 12:13:53 prox-02 systemd-fsck[841]: /dev/sdd1: recovering journal
Jun  7 12:13:53 prox-02 systemd-fsck[843]: /dev/sde1: recovering journal
Jun  7 12:13:53 prox-02 systemd-fsck[843]: /dev/sde1: clean, 13/106815488 files, 6754296/427245568 blocks
Jun  7 12:13:53 prox-02 systemd-fsck[841]: /dev/sdd1: clean, 48/4890624 files, 7255376/19530752 blocks
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV wlfd02Ve6d4JgWcFCx0X7lVRrkohOsSg: using /dev/sdi not /dev/sdf
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV NRyr6FtByg0hyzSaKcew9GtPmF2dP05B: using /dev/sdj not /dev/sdg
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV O5Z27PPlc24IgK495LaaXGagiiRO3ynE: using /dev/sdk not /dev/sdh
Jun  7 12:13:53 prox-02 lvm[880]: 8 logical volume(s) in volume group "fc1_sas_lun2" now active
Jun  7 12:13:53 prox-02 lvm[880]: 1 logical volume(s) in volume group "fc1-ssd-lun1" now active
Jun  7 12:13:53 prox-02 lvm[880]: 6 logical volume(s) in volume group "fc1_data_lun0" now active
Jun  7 12:13:53 prox-02 lvm[880]: 2 logical volume(s) in volume group "a_sas_r0" now active
Jun  7 12:13:53 prox-02 lvm[880]: 0 logical volume(s) in volume group "b_sas_r1" now active
Jun  7 12:13:53 prox-02 lvm[880]: 6 logical volume(s) in volume group "pve_local" now active
Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!