Node freezed shortly after update to kernel 4.4.8-51

udo · Jun 7, 2016

Hi,
one of the fresh updated node frreezed today.
Nothing in the logfiles.
Console freezed too - no output about kernel panic.
No events in the ipmi-log.

The node was stable for months with pve3.4 and pve4.2.

Software changes:
Installing of multipath-tools
Wrong installation of ceph 10.2.1 - deinstallation and reinstall of 0.94.7
Update to the latest pve-versions

Code:

pveversion -v
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-51
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
ceph: 0.94.7-1jessie

Udo

udo · Jun 10, 2016

Hi,
unfortunality the same happens on the next node yesterday. Same hardware (mostly) but no multipath installed.

Code:

root@prox-03:~# pveversion -v
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-51
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
ceph: 0.94.7-1jessie

CPU is an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz.
Enough free ram, no ZFS.

Get an totem-error app. 17 min. before the node was dead and FC-Error twice (after I moved online an VM-disk):

Code:

Jun  9 09:06:57 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:06:57 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:08:40 prox-03 corosync[2408]:  [TOTEM ] A processor failed, forming new configuration.
Jun  9 09:08:41 prox-03 corosync[2408]:  [TOTEM ] A new membership (172.20.2.61:252) was formed. Members
Jun  9 09:08:41 prox-03 corosync[2408]:  [QUORUM] Members[3]: 1 2 3
Jun  9 09:08:41 prox-03 corosync[2408]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun  9 09:10:58 prox-03 pvedaemon[31839]: <udo@ldap> move disk VM 303: move --disk virtio0 --storage fc1_sas_lun2
Jun  9 09:10:58 prox-03 pvedaemon[31839]: <udo@ldap> starting task UPID:prox-03:00001BA9:018EEA74:57591682:qmmove:303:udo@ldap:
Jun  9 09:13:14 prox-03 pvedaemon[10961]: worker exit
Jun  9 09:13:14 prox-03 pvedaemon[2475]: worker 10961 finished
Jun  9 09:13:14 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:13:14 prox-03 pvedaemon[2475]: worker 8685 started
Jun  9 09:13:58 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:14:29 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:14:30 prox-03 kernel: [261618.638080] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.638420] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.638757] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639094] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639439] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.639780] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640099] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640413] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.640724] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.641037] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:14:30 prox-03 kernel: [261618.644069] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  0 2002.
Jun  9 09:14:35 prox-03 pvestatd[2433]: status update time (34.462 seconds)
Jun  9 09:15:01 prox-03 CRON[9488]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun  9 09:17:01 prox-03 CRON[10659]: (root) CMD (  cd / && run-parts --report /etc/cron.hourly)
Jun  9 09:18:50 prox-03 kernel: [261878.579255] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.579755] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.580259] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.580743] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.581205] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.581687] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.582151] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:50 prox-03 kernel: [261878.584948] qla2xxx [0000:04:00.0]-801c:2: Abort command issued nexus=2:1:2 --  1 2002.
Jun  9 09:18:54 prox-03 pvedaemon[31839]: worker exit
Jun  9 09:18:54 prox-03 pvedaemon[2475]: worker 31839 finished
Jun  9 09:18:54 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:18:54 prox-03 pvedaemon[2475]: worker 11506 started
Jun  9 09:18:54 prox-03 pvestatd[2433]: status update time (28.839 seconds)
Jun  9 09:21:27 prox-03 pmxcfs[2309]: [status] notice: received log
Jun  9 09:22:28 prox-03 pvedaemon[18100]: worker exit
Jun  9 09:22:28 prox-03 pvedaemon[2475]: worker 18100 finished
Jun  9 09:22:28 prox-03 pvedaemon[2475]: starting 1 worker(s)
Jun  9 09:22:28 prox-03 pvedaemon[2475]: worker 13851 started
######### here was the node resetet ##########
Jun  9 09:29:55 prox-03 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="2455" x-info="http://www.rsyslog.com"] start
Jun  9 09:29:55 prox-03 systemd-modules-load[449]: Module 'fuse' is builtin
Jun  9 09:29:55 prox-03 systemd-modules-load[449]: Inserted module 'vhost_net'
Jun  9 09:29:55 prox-03 hdparm[488]: Setting parameters of disc: (none).
Jun  9 09:29:55 prox-03 keyboard-setup[489]: Setting preliminary keymap...done.
Jun  9 09:29:55 prox-03 systemd-fsck[736]: /dev/sdc1: recovering journal
Jun  9 09:29:55 prox-03 systemd-fsck[736]: /dev/sdc1: clean, 16/61038592 files, 33908830/244137472 blocks
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "bacula" now active
Jun  9 09:29:55 prox-03 systemd-fsck[947]: /dev/mapper/bacula-backup: recovering journal
Jun  9 09:29:55 prox-03 lvm[931]: 17 logical volume(s) in volume group "fc1_sas_lun2" now active
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "fc1-ssd-lun1" now active
Jun  9 09:29:55 prox-03 systemd-fsck[947]: /dev/mapper/bacula-backup: clean, 33021/395313152 files, 2018185557/3162505216 blocks
Jun  9 09:29:55 prox-03 lvm[931]: 6 logical volume(s) in volume group "fc1_data_lun0" now active
Jun  9 09:29:55 prox-03 lvm[931]: 4 logical volume(s) in volume group "local_ssd" now active
Jun  9 09:29:55 prox-03 lvm[931]: 1 logical volume(s) in volume group "local_pve" now active
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpuset
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpu
Jun  9 09:29:55 prox-03 kernel: [  0.000000] Initializing cgroup subsys cpuacct
Jun  9 09:29:55 prox-03 systemd-fsck[1259]: /dev/mapper/local_pve-data: recovering journal
...

Udo

udo · Jun 10, 2016

Hi again,
just look in the logfile from the first crash... there is nothing in my eyes (except of an linebreak)

Code:

Jun  7 11:57:23 prox-02 rrdcached[2695]: flushing old values
Jun  7 11:57:23 prox-02 rrdcached[2695]: rotating journals
Jun  7 11:57:23 prox-02 rrdcached[2695]: started new journal /var/lib/rrdcached/journal/rrd.journal.1465293443.229715
Jun  7 11:57:23 prox-02 rrdcached[2695]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1465286243.229732
Jun  7 11:57:40 prox-02 puppet-agent[1885]: Finished catalog run in 1.11 seconds
Jun  7 11:58:40 prox-02 pveproxy[23848]: worker exit
Jun  7 11:58:40 prox-02 pveproxy[29147]: worker 23848 finished
Jun  7 11:58:40 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 11:58:40 prox-02 pveproxy[29147]: worker 3430 started
Jun  7 11:58:40 prox-02 pvedaemon[30198]: <root@pam> successful auth for user 'fxxxxx@ldap'
Jun  7 11:58:50 prox-02 pmxcfs[2886]: [status] notice: received log
Jun  7 11:59:56 prox-02 pveproxy[25187]: worker exit
Jun  7 11:59:56 prox-02 pveproxy[29147]: worker 25187 finished
Jun  7 11:59:56 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 11:59:56 prox-02 pveproxy[29147]: worker 4347 started
Jun  7 12:02:45 prox-02 pveproxy[28676]: worker exit
Jun  7 12:02:45 prox-02 pveproxy[29147]: worker 28676 finished
Jun  7 12:02:45 prox-02 pveproxy[29147]: starting 1 worker(s)
Jun  7 12:02:45 prox-02 pveproxy[29147]: worker 6779 started
Jun  7 12:04:22 prox-02 pmxcfs[2886]: [dcdb] notice: data verification successful
Jun  7 12:04:31 prox-02 pmxcfs[2886]: [status] notice: received log

Jun  7 12:13:53 prox-02 systemd-modules-load[415]: Module 'fuse' is builtin
Jun  7 12:13:53 prox-02 systemd-modules-load[415]: Inserted module 'vhost_net'
Jun  7 12:13:53 prox-02 hdparm[445]: Setting parameters of disc: (none).
Jun  7 12:13:53 prox-02 keyboard-setup[446]: Setting preliminary keymap...done.
Jun  7 12:13:53 prox-02 systemd-fsck[841]: /dev/sdd1: recovering journal
Jun  7 12:13:53 prox-02 systemd-fsck[843]: /dev/sde1: recovering journal
Jun  7 12:13:53 prox-02 systemd-fsck[843]: /dev/sde1: clean, 13/106815488 files, 6754296/427245568 blocks
Jun  7 12:13:53 prox-02 systemd-fsck[841]: /dev/sdd1: clean, 48/4890624 files, 7255376/19530752 blocks
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV wlfd02Ve6d4JgWcFCx0X7lVRrkohOsSg: using /dev/sdi not /dev/sdf
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV NRyr6FtByg0hyzSaKcew9GtPmF2dP05B: using /dev/sdj not /dev/sdg
Jun  7 12:13:53 prox-02 lvm[880]: Found duplicate PV O5Z27PPlc24IgK495LaaXGagiiRO3ynE: using /dev/sdk not /dev/sdh
Jun  7 12:13:53 prox-02 lvm[880]: 8 logical volume(s) in volume group "fc1_sas_lun2" now active
Jun  7 12:13:53 prox-02 lvm[880]: 1 logical volume(s) in volume group "fc1-ssd-lun1" now active
Jun  7 12:13:53 prox-02 lvm[880]: 6 logical volume(s) in volume group "fc1_data_lun0" now active
Jun  7 12:13:53 prox-02 lvm[880]: 2 logical volume(s) in volume group "a_sas_r0" now active
Jun  7 12:13:53 prox-02 lvm[880]: 0 logical volume(s) in volume group "b_sas_r1" now active
Jun  7 12:13:53 prox-02 lvm[880]: 6 logical volume(s) in volume group "pve_local" now active

Udo

Search

Search

Node freezed shortly after update to kernel 4.4.8-51

udo

Distinguished Member

udo

Distinguished Member

udo

Distinguished Member