[SOLVED] cluster with different versions of kernel

R0bin

Member
Dec 6, 2019
27
0
21
35
Montpellier
Hi !
I have a PVE cluster with 3 nodes, ceph and all works fine. Once or twice a month I do a dist-upgrade on all my nodes (each nodes one after one). Hardware is the same on each bare metal server.
recently, my zabbix monitoring alert me because /boot is using more than 80% of free space, on each nodes, after an upgrade.
I have run thing like apt-get autoremove or dpkg --purge to remove old kernels, but when verifying, I see that my kernel used are not the same on each nodes named occ-host-000{1..3} :
Code:
Linux occ-host-0001 5.13.19-4-pve #1 SMP PVE 5.13.19-9 (Mon, 07 Feb 2022 11:01:14 +0100) x86_64 GNU/Linux
Linux occ-host-0002 5.13.19-4-pve #1 SMP PVE 5.13.19-9 (Mon, 07 Feb 2022 11:01:14 +0100) x86_64 GNU/Linux
Linux occ-host-0003 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200) x86_64 GNU/Linux
0003 seems to be late (I dont know why...)

Code:
root@occ-host-0001:~# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.13.19-4-pve)
root@occ-host-0002:~# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.13.19-4-pve)
root@occ-host-0003:~# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.35-1-pve)
PVE are in same version

Code:
root@occ-host-0001:~# df -h |grep "/boot"
/dev/sda2                                                        488M  271M  182M  60% /boot
root@occ-host-0002:~# df -h |grep "/boot"
/dev/sda2                                                        488M  394M   59M  88% /boot
root@occ-host-0003:~# df -h |grep "/boot"
/dev/sda2                                                        488M  332M  121M  74% /boot
Used space in all /boot are different

Code:
root@occ-host-0001:~# dpkg --list |grep pve-kernel
ii  pve-firmware                         3.4-2                          all          Binary firmware code for the pve-kernel
ii  pve-kernel-5.13                      7.1-9                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.13.19-4-pve             5.13.19-9                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.13.19-6-pve             5.13.19-15                     amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15                      7.2-6                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.15.35-1-pve             5.15.35-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.39-1-pve             5.15.39-1                      amd64        Proxmox Kernel Image
ii  pve-kernel-helper                    7.2-6                          all          Function for various kernel maintenance tasks.

root@occ-host-0002:~# dpkg --list |grep pve-kernel
ii  pve-firmware                         3.4-2                          all          Binary firmware code for the pve-kernel
ii  pve-kernel-5.13                      7.1-9                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.13.19-2-pve             5.13.19-4                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.13.19-4-pve             5.13.19-9                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.13.19-6-pve             5.13.19-15                     amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15                      7.2-6                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.15.35-1-pve             5.15.35-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.39-1-pve             5.15.39-1                      amd64        Proxmox Kernel Image
ii  pve-kernel-5.4                       6.4-7                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.4.143-1-pve             5.4.143-1                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    7.2-6                          all          Function for various kernel maintenance tasks.

root@occ-host-0003:~# dpkg --list |grep pve-kernel
ii  pve-firmware                         3.4-2                          all          Binary firmware code for the pve-kernel
ii  pve-kernel-5.13                      7.1-9                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.13.19-2-pve             5.13.19-4                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.13.19-6-pve             5.13.19-15                     amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15                      7.2-6                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.15.35-1-pve             5.15.35-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.39-1-pve             5.15.39-1                      amd64        Proxmox Kernel Image
ii  pve-kernel-5.4                       6.4-7                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.4.143-1-pve             5.4.143-1                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    7.2-6                          all          Function for various kernel maintenance tasks.
Installed kernels are different on each nodes !
What's wrong ? I suspect 003 to use the lst kernel before upgrading to PVE 7 (but not sure). is there a grub issue ? how to check that ?
How to free space on /boot safely ?

Thank you for helping me, for now i'm scared to upgrade again occ-host-0002 kernels (too few space on /boot), and scared to rebbot occ-host-0003 (I don't understand this kernel age).
 
Last edited:
First, having different kernels is not a problem per se, but you should keep them at least in the same series (e.g. 5.15).
You normally only need one fall-back-kernel (best the one you had before the update), so you can remove the older kernel from older series (5.13 and 5.2) and it'll free a lot of space.
 
thanks for reply,
on the last code block of my previous post, Wee can see that latests kernels installer (E.G pve-kernel-5.15.39-1-pve for host 0001 is installed but 5.13.19-4-pve is runed).
Is it possible to load newest kernel (and remove older) without reboot ?
 
Is it possible to load newest kernel (and remove older) without reboot ?
Yes and no. There are techniques to load a newer kernel and also until PVE6 KernelCare was possible with PVE, now it is seemlingly not the case anymore (or at least I don't find information about it). I personally have not tried it yet, I always reboot. A reboot is also a good test if everthing comes back up correctly so that you can have more trust in your machine.

Normally we do rolling upgrades of our cluster for each node:
- distribue all running VMs across all other nodes
- running dist-upgrade
- reboot
- check if everthing is working
- migrate one or more not so important VM back, check again, potentially wait a few hours if problems arise
- continue with the next node
 
I'have done what you explain. All works fine, (exept ceph doing clean+snaptrim but I don't worry about that).
After reboot, networking service was not aviable, I had to run a service networking restart on each node.
To fix that I try "systemctl enable networking", and I will verify if it's working on next upgrade and reboot :)

Thank you for answers.