kernel panic all 4 nodes in a cluster. VMs still operational

yaboc

Renowned Member
Nov 13, 2012
85
2
73
hi

im on
pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve)

with ISCSI backend for VM storage and NFS backup. i believe the backup failed and casued all nodes go dark, still pingable and accessible

root@px1:~# pvecm status
Quorum information
------------------
Date: Tue Jan 24 12:48:16 2017
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1/6172
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.18.66.30 (local)
0x00000002 1 10.18.66.31
0x00000003 1 10.18.66.32
0x00000004 1 10.18.66.33


25.520642] audit: type=1400 audit(1485273665.965:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=2172 comm="apparmor_parser"
[ 25.889976] cgroup: new mount options do not match the existing superblock, will be ignored
[ 26.240282] cgroup: new mount options do not match the existing superblock, will be ignored
[ 26.340868] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[ 26.743357] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 27.725131] device eth0 left promiscuous mode
[ 27.814784] device vmbr0 left promiscuous mode
[ 27.897404] device eth1 left promiscuous mode
[ 27.954009] device vmbr1 left promiscuous mode
[ 28.069865] device vmbr1 entered promiscuous mode
[ 28.108078] device eth1 entered promiscuous mode
[ 29.043345] device vmbr0 entered promiscuous mode
[ 29.081261] device eth0 entered promiscuous mode
[ 31.849443] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 31.938346] ip_set: protocol 6
[ 42.186634] scsi host4: iSCSI Initiator over TCP/IP
[ 42.448386] scsi 4:0:0:0: Direct-Access FreeBSD iSCSI Disk 0123 PQ: 0 ANSI: 6
[ 42.449226] sd 4:0:0:0: Attached scsi generic sg2 type 0
[ 42.449972] sd 4:0:0:0: [sdc] 2306867200 512-byte logical blocks: (1.18 TB/1.07 TiB)
[ 42.449974] sd 4:0:0:0: [sdc] 16384-byte physical blocks
[ 42.455252] sd 4:0:0:0: [sdc] Write Protect is off
[ 42.455255] sd 4:0:0:0: [sdc] Mode Sense: 73 00 10 08
[ 42.468025] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 42.499551] sd 4:0:0:0: [sdc] Attached SCSI disk
[ 240.236101] INFO: task spiceproxy:2951 blocked for more than 120 seconds.
[ 240.236313] Tainted: P C O 4.4.35-1-pve #1
[ 240.243884] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 240.251465] spiceproxy D ffff8805dc553df8 0 2951 1 0x00000004
[ 240.251470] ffff8805dc553df8 ffff880603234840 ffffffff81e12580 ffff8805f975e600
[ 240.251473] ffff8805dc554000 ffff8805e9c109ac ffff8805f975e600 00000000ffffffff
[ 240.251475] ffff8805e9c109b0 ffff8805dc553e10 ffffffff81858155 ffff8805e9c109a8
[ 240.251478] Call Trace:
[ 240.251487] [<ffffffff81858155>] schedule+0x35/0x80
[ 240.251490] [<ffffffff8185840e>] schedule_preempt_disabled+0xe/0x10
[ 240.251493] [<ffffffff8185a109>] __mutex_lock_slowpath+0xb9/0x130
[ 240.251495] [<ffffffff8185a19f>] mutex_lock+0x1f/0x30
[ 240.251499] [<ffffffff8121f35a>] filename_create+0x7a/0x160
[ 240.251502] [<ffffffff812202f3>] SyS_mkdir+0x53/0x100
[ 240.251505] [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75
[ 360.248102] INFO: task spiceproxy:2951 blocked for more than 120 seconds.
[ 360.255671] Tainted: P C O 4.4.35-1-pve #1
[ 360.263114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 360.270586] spiceproxy D ffff8805dc553df8 0 2951 1 0x00000004
[ 360.270592] ffff8805dc553df8 ffff880603234840 ffffffff81e12580 ffff8805f975e600
[ 360.270595] ffff8805dc554000 ffff8805e9c109ac ffff8805f975e600 00000000ffffffff
[ 360.270597] ffff8805e9c109b0 ffff8805dc553e10 ffffffff81858155 ffff8805e9c109a8
[ 360.270600] Call Trace:

it goes to kernel panic as soon as the node is rebooted. after i rebooted one node (node4) the VMs on it won't start.

the vzdump log shows that all vms were sucesfully backup

Jan 20 20:21:37 INFO: status: 98% (33674428416/34359738368), sparse 23% (8209256448), duration 1293, 22/22 MB/s
Jan 20 20:21:49 INFO: status: 99% (34028191744/34359738368), sparse 24% (8343355392), duration 1305, 29/18 MB/s
Jan 20 20:22:03 INFO: status: 100% (34359738368/34359738368), sparse 24% (8352653312), duration 1319, 23/23 MB/s
Jan 20 20:22:03 INFO: transferred 34359 MB in 1319 seconds (26 MB/s)
Jan 20 20:22:05 INFO: archive file size: 14.09GB
Jan 20 20:22:06 INFO: delete old backup '/mnt/pve/VM_BACKUP/dump/vzdump-qemu-104-2017_01_13-20_00_02.vma.lzo'
Jan 20 20:22:10 INFO: Finished Backup of VM 104 (00:22:08)

vm commands just freeze in the ssh session. tried to restore one and GUI and it just wont do it. creating new test VM also fails

root@px1:~# /etc/init.d/pve-cluster restart
[....] Restarting pve-cluster (via systemctl): pve-cluster.serviceJob for pve-cluster.service failed. See 'systemctl status pve-cluster.service' and 'journalctl -xn' for details.
failed!
root@px1:~# systemctl status pve-cluster.service
â pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Tue 2017-01-24 13:02:35 EST; 10s ago
Process: 21534 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 21537 (code=killed, signal=KILL)

Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: members: 1/21537, 2/2466, 3/2426, 4/2605
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:02:24 px1 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jan 24 13:02:34 px1 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jan 24 13:02:35 px1 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jan 24 13:02:35 px1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 24 13:02:35 px1 systemd[1]: Unit pve-cluster.service entered failed state.
root@px1:~# journalctl -xn
-- Logs begin at Wed 2016-12-21 03:05:38 EST, end at Tue 2017-01-24 13:03:55 EST. --
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused


root@px1:~# /etc/init.d/pve-cluster restart
[....] Restarting pve-cluster (via systemctl): pve-cluster.serviceJob for pve-cluster.service failed. See 'systemctl status pve-cluster.service' and 'journalctl -xn' for details.
failed!
root@px1:~# systemctl status pve-cluster.service
â pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Tue 2017-01-24 13:02:35 EST; 10s ago
Process: 21534 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 21537 (code=killed, signal=KILL)

Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: members: 1/21537, 2/2466, 3/2426, 4/2605
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: starting data syncronisation
Jan 24 13:00:54 px1 pmxcfs[21537]: [dcdb] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:00:54 px1 pmxcfs[21537]: [status] notice: received sync request (epoch 1/21537/00000001)
Jan 24 13:02:24 px1 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jan 24 13:02:34 px1 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jan 24 13:02:35 px1 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jan 24 13:02:35 px1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 24 13:02:35 px1 systemd[1]: Unit pve-cluster.service entered failed state.
root@px1:~# journalctl -xn
-- Logs begin at Wed 2016-12-21 03:05:38 EST, end at Tue 2017-01-24 13:03:55 EST. --
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-crm[3031]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pve-ha-lrm[3044]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
Jan 24 13:03:55 px1 pvestatd[3015]: ipcc_send_rec failed: Connection refused
 
Last edited:
somehow the backup broke the nfs connection and i had to reboot all nodes in order to bring everything back up. ugpraded to the latest pve and kernel
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!