Hello,
I seem to be having a problem with one of my nodes in my 3-node Proxmox CEPH cluster. Every x amount of days one of the pve services gets killed by the kernel as hung_task and the node gets stuck in a constant 12% iowait load. It does not recover from this; I have to reboot the node to get it back to normal operation. CEPH cluster status and VM's all seem to be fine with this as it happens, but the load is noticeable on the node itself (slow ssh login, etc). As far as I can tell nothing weird is happening and this occurs quite 'out of the blue'.
I only have KVM VM's running on this node, with HR enabled for some. These use the default CEPH storage created automatically by UI (rdbd). Nodes are based on single Xeon E3-1231v3's with 16GB RAM per node on Supermicro mainboards (pending upgrade to 32GB per node soon). Using Proxmox VE Community 5.1.
I do have NFS storage entries, but these are not enabled and weren't at the time the iowait decided to go up and stick without clear reason. The OS/waldb SSD (a Samsung SM863) is not worn out (0% wear), shows no errors in smartctl and seems to be fine overall.
Did anyone else experience this kind of behaviour, and how would I go about fixing it?
Attached excerpt of syslog (graphs showing it happens from the time it says the task pvesr hung, the previous time this happened it was pve-ha-lrm, so this seems to vary.
Thanks in advance.
I seem to be having a problem with one of my nodes in my 3-node Proxmox CEPH cluster. Every x amount of days one of the pve services gets killed by the kernel as hung_task and the node gets stuck in a constant 12% iowait load. It does not recover from this; I have to reboot the node to get it back to normal operation. CEPH cluster status and VM's all seem to be fine with this as it happens, but the load is noticeable on the node itself (slow ssh login, etc). As far as I can tell nothing weird is happening and this occurs quite 'out of the blue'.
I only have KVM VM's running on this node, with HR enabled for some. These use the default CEPH storage created automatically by UI (rdbd). Nodes are based on single Xeon E3-1231v3's with 16GB RAM per node on Supermicro mainboards (pending upgrade to 32GB per node soon). Using Proxmox VE Community 5.1.
I do have NFS storage entries, but these are not enabled and weren't at the time the iowait decided to go up and stick without clear reason. The OS/waldb SSD (a Samsung SM863) is not worn out (0% wear), shows no errors in smartctl and seems to be fine overall.
Did anyone else experience this kind of behaviour, and how would I go about fixing it?
Attached excerpt of syslog (graphs showing it happens from the time it says the task pvesr hung, the previous time this happened it was pve-ha-lrm, so this seems to vary.
Code:
proxmox-ve: 5.1-35 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-15
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1
Thanks in advance.
Attachments
Last edited: