4.15 based test kernel for PVE 5.x available

After a week in an production environment with 4.15.3-1 ... again one node with questionmark.
I will try 4.15.10-1-pve
 
I have 4.15.3 running without any issues.

What is the error related to? SSL or KSM or just node becomes grey and one LXC guest not starting?

I faced few node restart issue due to KSM not starting.

I had to manually set KSM starting memory usage % to 75% to avoid issue and did systemctl restart ksmtuned
 
I don't have AMD based nodes. So i didn't test it.

But I can confirm that for LXC, there are still many bugs causing node restart which are very annoying

if you are referring to KSM not merging pages fast enough for your setup - that is not a bug. overcommitting resources is always a dangerous game to play.
 
No that is not the issue.

Suppose I already started 3 guests and node is at 75% memory usage, it will not start KSM sharing.

And when i start 4th guest, the node crashes and restarts.

I reproduced the same error multiple times. Every time node crashed.

Then I changed the KSM threshhold to 50% (KSM_THRES_COEF=50) , then KSM starts when i have three guests started.

And I can start 4th guest without any crash.
 
pve-kernel-4.15.10-1-pve also has the above KSM sharing issue.

If you have plenty of memory, you will not see it.

I have 25+ nodes, and i don't have plenty of memory, so i see it often.

But after setting the % of KSM thresh hold., i d didn't face any issue.
 
No that is not the issue.

Suppose I already started 3 guests and node is at 75% memory usage, it will not start KSM sharing.

And when i start 4th guest, the node crashes and restarts.

I reproduced the same error multiple times. Every time node crashed.

Then I changed the KSM threshhold to 50% (KSM_THRES_COEF=50) , then KSM starts when i have three guests started.

And I can start 4th guest without any crash.

like I said - this is not a bug. when you overcommit resources, you need to carefully plan otherwise you might run out of resources. KSM is always asynchronous. unless you have some details to share which you haven't included so far that actually point to a bug, please stop posting this "issue" in this thread. thanks.
 
With default settings, node crashes when I start the 4th guest.

With changed settings node does not crash when i start 4th guest, since it starts KSM early enough.

I think it is more like LXC related issues than a bug.

In KVM I didn't face any issues.
 
System crashed and restarted time 20:02:00

(I have 25 live nodes, 1 or 2 nodes crashes like this everyday. )

Log file.

Mar 29 19:39:42 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:39:56 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:40:04 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:40:35 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:40:57 Q172 pvedaemon[4385]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:41:55 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:42:13 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:42:56 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:44:40 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:45:03 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:45:03 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:49:19 Q172 pvedaemon[9167]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:56:44 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 19:59:09 Q172 pvedaemon[2841]: <root@pam> successful auth for user 'root@pam'
Mar 29 20:02:19 Q172 kernel: [ 0.000000] Linux version 4.15.3-1-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.3-1 (Fri, 9 Mar 2018 14:45:34 +0100) ()
Mar 29 20:02:19 Q172 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.15.3-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 29 20:02:19 Q172 kernel: [ 0.000000] KERNEL supported cpus:
Mar 29 20:02:19 Q172 kernel: [ 0.000000] Intel GenuineIntel
Mar 29 20:02:19 Q172 kernel: [ 0.000000] AMD AuthenticAMD
Mar 29 20:02:19 Q172 kernel: [ 0.000000] Centaur CentaurHauls
Mar 29 20:02:19 Q172 kernel: [ 0.000000] x86/fpu: x87 FPU will use FXSAVE
Mar 29 20:02:19 Q172 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009e7ff] usable
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009e800-0x000000000009ffff] reserved
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bf72ffff] usable
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000bf730000-0x00000000bf73dfff] ACPI data
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000bf73e000-0x00000000bf79ffff] ACPI NVS
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000bf7a0000-0x00000000bf7affff] reserved
Mar 29 20:02:19 Q172 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000bf7bc000-0x00000000bfffffff] reserved
 
@efeu: I have a customized mini home server with some VMs and Containers. It's a

Threadripper 1900X
Asrock X399 Taichi
2 x 16 GB DDR4 2400 Kingston ECC unbuffered RAM
nVidia G710 (host)
nVidia GTX1080 (Win10 guest)
250 GB Samsung EVO SSD (host only)
512 GB Samsung EVO SSD ZFS pool (Win10 guest + 5 VMs)
3 x 3 TB Samsung Eco green HDD in RAIDZ1 (Fileserver, Container, Templates)

Running latest proxmox:
root@pve:~# pveversion --verbose
proxmox-ve: 5.1-42 (running kernel: 4.15.10-1-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13: 5.1-43
pve-kernel-4.15: 5.1-2
pve-kernel-4.15.10-1-pve: 4.15.10-2
pve-kernel-4.13.16-1-pve: 4.13.16-43
pve-kernel-4.13.13-6-pve: 4.13.13-42
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-3
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

Still in need of the java fix for pci-e passthrough to my Win10 gaming system as well as problems with gpu sleep where only a reboot of the host system fixes the loss of gpu to the guest OS. Regardless of that, everything seems to be working stable and very fast.

1 x Win10 VM KVM with PCI-e passthrough for gaming (4 cores, 12 GB RAM)
1 x VM running ubuntu 16.10 with Squeezeboxserver (2 cores, 512 MB RAM)
1 x VM running debian 9 with nginx reverse proxy (4 cores, 1 GB RAM)
1 x VM running debian 9 with nextcloud (4 cores, 1 GB RAM)
1 x VM running debian 9 with mailserver (4 cores, 4 GB RAM)
1 x VM running debian 9 with monitoring (1 core, 1 GB RAM)
1 x LXC running ubuntu 16.10 with motioneye (2 cores, 1 GB RAM)
1 x LCC running ubuntu 16.10 with ampache music server (1 core, 512 MB RAM)

SMB and NFS via ZFS.
 

Attachments

  • proxmox.jpeg
    proxmox.jpeg
    51.2 KB · Views: 21
So even with 4.15.10-1-pve cpu type host is not working for Zen. Windows bootup is starting, but then after a while the VM eats 800-1400% CPU in top and nothing more is happening. Also I recognized that you can not passthrough the CPU internal USB controller to a VM anymore, which was working absolutly fine with 4.13....

I do not see any ubuntu work on this issue, so maybe the proxmox team could find out, which changes are causing this problems and revert them for the proxmox kernel, I mean a AMD compatible kernel should be something very important for a virtualization distribution, dont u agree?
 
Hi,
just tried kernel 4.15 on an Dell R620 with Perc 710 mini Raid-Volume (lvm).

4.15.10 is booting fine. 4.15.15 from pvetest stuck after:
Code:
[   1.104090] megaraid_sas 0000:03:00.0: Inint cmd return status SUCCESS for SCSI host 0
after a longer time (minutes) one more line:
Code:
Reading sll physical volumes. This may take a while...
then tree times (363s, 605s + 846s) INFO: task lvm:375 blocked for more than 120 seconds. (if I press the on/off switch).

Udo
 
@efeu: I have a customized mini home server with some VMs and Containers. It's a

Threadripper 1900X
Asrock X399 Taichi
2 x 16 GB DDR4 2400 Kingston ECC unbuffered RAM
nVidia G710 (host)
nVidia GTX1080 (Win10 guest)
250 GB Samsung EVO SSD (host only)
512 GB Samsung EVO SSD ZFS pool (Win10 guest + 5 VMs)
3 x 3 TB Samsung Eco green HDD in RAIDZ1 (Fileserver, Container, Templates)

Running latest proxmox:
root@pve:~# pveversion --verbose
proxmox-ve: 5.1-42 (running kernel: 4.15.10-1-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13: 5.1-43
pve-kernel-4.15: 5.1-2
pve-kernel-4.15.10-1-pve: 4.15.10-2
pve-kernel-4.13.16-1-pve: 4.13.16-43
pve-kernel-4.13.13-6-pve: 4.13.13-42
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-3
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

Still in need of the java fix for pci-e passthrough to my Win10 gaming system as well as problems with gpu sleep where only a reboot of the host system fixes the loss of gpu to the guest OS. Regardless of that, everything seems to be working stable and very fast.

1 x Win10 VM KVM with PCI-e passthrough for gaming (4 cores, 12 GB RAM)
1 x VM running ubuntu 16.10 with Squeezeboxserver (2 cores, 512 MB RAM)
1 x VM running debian 9 with nginx reverse proxy (4 cores, 1 GB RAM)
1 x VM running debian 9 with nextcloud (4 cores, 1 GB RAM)
1 x VM running debian 9 with mailserver (4 cores, 4 GB RAM)
1 x VM running debian 9 with monitoring (1 core, 1 GB RAM)
1 x LXC running ubuntu 16.10 with motioneye (2 cores, 1 GB RAM)
1 x LCC running ubuntu 16.10 with ampache music server (1 core, 512 MB RAM)

SMB and NFS via ZFS.



I am running identical hardware and funny enough VM config with Hyper-V at the moment with the host partition being my gaming VM. I have been waiting for the same pcie java fix to make a move to this hypervisor/VM config, has this fix dropped by chance? If so what has your experience been?

Also a question for the dev team, once the 4.15 kernel is labeled stable, how easy will it be to switch an install to the new branch?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!