AAC RAID problem with Kernel 4.13

xkill

Active Member
Nov 14, 2017
8
0
41
www.locolandia.net
Hi,

I rebooted my server with an aacraid controller, and after the reboot with kernel 4.13.4 I noted several problems on the IO performance, so I investigated it and I noted the following problem:

Code:
[  352.348786] irq 16: nobody cared (try booting with the "irqpoll" option)
[  352.348830] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.13.4-1-pve #1
[  352.348860] Hardware name: System manufacturer System Product Name/P8H77-M PRO, BIOS 9002 05/30/2014
[  352.348901] Call Trace:
[  352.348926]  <IRQ>
[  352.348954]  dump_stack+0x63/0x8b
[  352.348981]  __report_bad_irq+0x35/0xc0
[  352.349008]  note_interrupt+0x234/0x280
[  352.349035]  handle_irq_event_percpu+0x54/0x80
[  352.349063]  handle_irq_event+0x3b/0x60
[  352.349090]  handle_fasteoi_irq+0x79/0x120
[  352.349117]  handle_irq+0x1a/0x30
[  352.349144]  do_IRQ+0x46/0xd0
[  352.349169]  common_interrupt+0x89/0x89
[  352.349197] RIP: 0010:cpuidle_enter_state+0x126/0x2c0
[  352.349225] RSP: 0018:ffffc3348191fe60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff3d
[  352.349264] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000001f
[  352.349295] RDX: 00000052099fc169 RSI: ffffffec101964a0 RDI: 0000000000000000
[  352.349325] RBP: ffffc3348191fe98 R08: 00000000ffffffff R09: 0000000000000008
[  352.349355] R10: ffffc3348191fe30 R11: 0000000000000000 R12: ffff9f765fb24f00
[  352.349385] R13: ffffffffbf1785f8 R14: 00000052099fc169 R15: ffffffffbf1785e0
[  352.349415]  </IRQ>
[  352.349440]  cpuidle_enter+0x17/0x20
[  352.349467]  call_cpuidle+0x23/0x40
[  352.349493]  do_idle+0x199/0x200
[  352.349519]  cpu_startup_entry+0x73/0x80 
[  352.349547]  start_secondary+0x156/0x190 
[  352.349575]  secondary_startup_64+0x9f/0x9f
[  352.349602] handlers:
[  352.349629] [<ffffffffc0333e50>] aac_rx_intr_message [aacraid]
[  352.349657] Disabling IRQ #16

Code:
[ 2864.566916] aacraid 0000:01:00.0: invalid short VPD tag 00 at offset 1
[ 2864.567397] r8169 0000:03:00.0: invalid large VPD tag 7f at offset 0


I also tried to boot using the "noirq" option, but I had the same problem.

Not sure if it's a bug on the Linux kernel or on Proxmox kernel.

Can I compile my own kernel based on Proxmox kernel ".config" ? Do I have to patch anything? (I noted that git contains few patches, but not sure what patches are required. I'm not using ZFS,Corosync or Intel network cards)

Attached a file with my hardware details.
 

Attachments

you could try booting a vanilla 4.14 kernel. if that one doesn't have the issue, you could try each of the 4.14 RCs in backwards order and check when the issue was fixed. if 4.14 is also affected, you can go backwards in time trying the first and last releases of each kernel series (4.13, 4.12, ...) until you find one which is not affected. http://kernel.ubuntu.com/~kernel-ppa/mainline/ contains builds of most mainline kernels as .deb files. once we have a rang of "used to work" - "does not work anymore" or "does not work" - "works again", we can hopefully narrow down the issue to a commit and fix it in pve-kernel.
 
Hi,

Seems that the problem does not appear with Ubuntu kernel (4.14.0-041400-generic).

But the Containers does not start.

1st I noted the following problem
Code:
 pve-firewall[1808]: status update error: unable to open file '/proc/sys/net/bridge/bridge-nf-call-iptables' - No such file or directory

So I loaded the kernel module:
Code:
root@proxmox # modprobe br_netfilter

But now the LXC containers does not start, I got the following errors:
Code:
root@proxmox # lxc-start -F --name 100 --logfile /tmp/lxce-start-100.log --logpriority=DEBUG
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100"
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100-1" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100-1"
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100-2" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100-2"
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100-3" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100-3"
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100-4" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100-4"
lxc-start: 100: cgroups/cgfsng.c: create_path_for_hierarchy: 1335 Path "/sys/fs/cgroup/systemd//lxc/100-5" already existed.
lxc-start: 100: cgroups/cgfsng.c: cgfsng_create: 1431 Failed to create "/sys/fs/cgroup/systemd//lxc/100-5"
lxc-start: 100: cgroups/cgfsng.c: cgfsng_setup_limits: 2120 Permission denied - Error setting memory.memsw.limit_in_bytes to 1073741824 for 100
lxc-start: 100: start.c: lxc_spawn: 1274 Failed to setup cgroup limits for container "100".
lxc-start: 100: start.c: __lxc_start: 1469 Failed to spawn container "100".
lxc-start: 100: tools/lxc_start.c: main: 368 The container failed to start.
lxc-start: 100: tools/lxc_start.c: main: 372 Additional information can be obtained by setting the --logfile and --logpriority options.

Attached the DEBUG log I got from the previous command.

NOTE: KVM containers work fine.
 

Attachments

4.14 change some apparmor and cgroup stuff, so that is probably unrelated. if you are very motivated, you could now check the 4.14 RCs to see which one fixed it, and then go back in time to see when the problem first appeared (first the major versions like 4.13, 4.12, then once you found one that is affected, you could check its RCs as well).
 
My server is on production, when I have some time I'll try to install Proxmox on another computer (without aacraid adapter) and check it. I'll update this POST as soon as I have Proxmox working with 4.14 (probably I have to recompile Ubuntu kernel and add missing modules).
Can you let me know what Proxmox patches are required on kernel in order to have Proxmox working fine (ex. 0002-bridge-keep-MAC-of-first-assigned-port.patch (not sure if this patch is required by Proxmox to work fine, it's an example)).
 
My server is on production, when I have some time I'll try to install Proxmox on another computer (without aacraid adapter) and check it. I'll update this POST as soon as I have Proxmox working with 4.14 (probably I have to recompile Ubuntu kernel and add missing modules).
Can you let me know what Proxmox patches are required on kernel in order to have Proxmox working fine (ex. 0002-bridge-keep-MAC-of-first-assigned-port.patch (not sure if this patch is required by Proxmox to work fine, it's an example)).

I don't think you will be able to get 4.14 working on your own because of the AppArmor issues. maybe you can give the PVE 4.10 kernel linked in https://forum.proxmox.com/threads/proxmox-ve-5-1-released.37650/page-2#post-186927 a try?