kernel panic all 4 nodes in a cluster. VMs still operational

yaboc · Jan 24, 2017

hi

im on
pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve)

with ISCSI backend for VM storage and NFS backup. i believe the backup failed and casued all nodes go dark, still pingable and accessible

root@px1:~# pvecm status
Quorum information
------------------
Date: Tue Jan 24 12:48:16 2017
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1/6172
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.18.66.30 (local)
0x00000002 1 10.18.66.31
0x00000003 1 10.18.66.32
0x00000004 1 10.18.66.33

25.520642] audit: type=1400 audit(1485273665.965:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=2172 comm="apparmor_parser"
[ 25.889976] cgroup: new mount options do not match the existing superblock, will be ignored
[ 26.240282] cgroup: new mount options do not match the existing superblock, will be ignored
[ 26.340868] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[ 26.743357] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 27.725131] device eth0 left promiscuous mode
[ 27.814784] device vmbr0 left promiscuous mode
[ 27.897404] device eth1 left promiscuous mode
[ 27.954009] device vmbr1 left promiscuous mode
[ 28.069865] device vmbr1 entered promiscuous mode
[ 28.108078] device eth1 entered promiscuous mode
[ 29.043345] device vmbr0 entered promiscuous mode
[ 29.081261] device eth0 entered promiscuous mode
[ 31.849443] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 31.938346] ip_set: protocol 6
[ 42.186634] scsi host4: iSCSI Initiator over TCP/IP
[ 42.448386] scsi 4:0:0:0: Direct-Access FreeBSD iSCSI Disk 0123 PQ: 0 ANSI: 6
[ 42.449226] sd 4:0:0:0: Attached scsi generic sg2 type 0
[ 42.449972] sd 4:0:0:0: [sdc] 2306867200 512-byte logical blocks: (1.18 TB/1.07 TiB)
[ 42.449974] sd 4:0:0:0: [sdc] 16384-byte physical blocks
[ 42.455252] sd 4:0:0:0: [sdc] Write Protect is off
[ 42.455255] sd 4:0:0:0: [sdc] Mode Sense: 73 00 10 08
[ 42.468025] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 42.499551] sd 4:0:0:0: [sdc] Attached SCSI disk
[ 240.236101] INFO: task spiceproxy:2951 blocked for more than 120 seconds.
[ 240.236313] Tainted: P C O 4.4.35-1-pve #1
[ 240.243884] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 240.251465] spiceproxy D ffff8805dc553df8 0 2951 1 0x00000004
[ 240.251470] ffff8805dc553df8 ffff880603234840 ffffffff81e12580 ffff8805f975e600
[ 240.251473] ffff8805dc554000 ffff8805e9c109ac ffff8805f975e600 00000000ffffffff
[ 240.251475] ffff8805e9c109b0 ffff8805dc553e10 ffffffff81858155 ffff8805e9c109a8
[ 240.251478] Call Trace:
[ 240.251487] [<ffffffff81858155>] schedule+0x35/0x80
[ 240.251490] [<ffffffff8185840e>] schedule_preempt_disabled+0xe/0x10
[ 240.251493] [<ffffffff8185a109>] __mutex_lock_slowpath+0xb9/0x130
[ 240.251495] [<ffffffff8185a19f>] mutex_lock+0x1f/0x30
[ 240.251499] [<ffffffff8121f35a>] filename_create+0x7a/0x160
[ 240.251502] [<ffffffff812202f3>] SyS_mkdir+0x53/0x100
[ 240.251505] [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75
[ 360.248102] INFO: task spiceproxy:2951 blocked for more than 120 seconds.
[ 360.255671] Tainted: P C O 4.4.35-1-pve #1
[ 360.263114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 360.270586] spiceproxy D ffff8805dc553df8 0 2951 1 0x00000004
[ 360.270592] ffff8805dc553df8 ffff880603234840 ffffffff81e12580 ffff8805f975e600
[ 360.270595] ffff8805dc554000 ffff8805e9c109ac ffff8805f975e600 00000000ffffffff
[ 360.270597] ffff8805e9c109b0 ffff8805dc553e10 ffffffff81858155 ffff8805e9c109a8
[ 360.270600] Call Trace:

it goes to kernel panic as soon as the node is rebooted. after i rebooted one node (node4) the VMs on it won't start.

the vzdump log shows that all vms were sucesfully backup

Jan 20 20:21:37 INFO: status: 98% (33674428416/34359738368), sparse 23% (8209256448), duration 1293, 22/22 MB/s
Jan 20 20:21:49 INFO: status: 99% (34028191744/34359738368), sparse 24% (8343355392), duration 1305, 29/18 MB/s
Jan 20 20:22:03 INFO: status: 100% (34359738368/34359738368), sparse 24% (8352653312), duration 1319, 23/23 MB/s
Jan 20 20:22:03 INFO: transferred 34359 MB in 1319 seconds (26 MB/s)
Jan 20 20:22:05 INFO: archive file size: 14.09GB
Jan 20 20:22:06 INFO: delete old backup '/mnt/pve/VM_BACKUP/dump/vzdump-qemu-104-2017_01_13-20_00_02.vma.lzo'
Jan 20 20:22:10 INFO: Finished Backup of VM 104 (00:22:08)

vm commands just freeze in the ssh session. tried to restore one and GUI and it just wont do it. creating new test VM also fails

root@px1:~# /etc/init.d/pve-cluster restart
[....] Restarting pve-cluster (via systemctl): pve-cluster.serviceJob for pve-cluster.service failed. See 'systemctl status pve-cluster.service' and 'journalctl -xn' for details.
failed!
root@px1:~# systemctl status pve-cluster.service
â pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Tue 2017-01-24 13:02:35 EST; 10s ago
Process: 21534 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 21537 (code=killed, signal=KILL)

Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: members: 1/21537, 2/2466, 3/2426, 4/2605
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:02:24 px1 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jan 24 13:02:34 px1 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jan 24 13:02:35 px1 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jan 24 13:02:35 px1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 24 13:02:35 px1 systemd[1]: Unit pve-cluster.service entered failed state.
root@px1:~# journalctl -xn
-- Logs begin at Wed 2016-12-21 03:05:38 EST, end at Tue 2017-01-24 13:03:55 EST. --
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused

root@px1:~# /etc/init.d/pve-cluster restart
[....] Restarting pve-cluster (via systemctl): pve-cluster.serviceJob for pve-cluster.service failed. See 'systemctl status pve-cluster.service' and 'journalctl -xn' for details.
failed!
root@px1:~# systemctl status pve-cluster.service
â pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Tue 2017-01-24 13:02:35 EST; 10s ago
Process: 21534 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 21537 (code=killed, signal=KILL)

Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: members: 1/21537, 2/2466, 3/2426, 4/2605
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:02:24 px1 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jan 24 13:02:34 px1 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jan 24 13:02:35 px1 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jan 24 13:02:35 px1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 24 13:02:35 px1 systemd[1]: Unit pve-cluster.service entered failed state.
root@px1:~# journalctl -xn
-- Logs begin at Wed 2016-12-21 03:05:38 EST, end at Tue 2017-01-24 13:03:55 EST. --
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused

yaboc · Jan 25, 2017

somehow the backup broke the nfs connection and i had to reboot all nodes in order to bring everything back up. ugpraded to the latest pve and kernel

Search

Search

kernel panic all 4 nodes in a cluster. VMs still operational

yaboc

Renowned Member

yaboc

Renowned Member