0-100% instant random core utilization

hanzfritz85

Member
Dec 15, 2018
43
0
11
36
Dear All,
Please help. Two identical servers: Dell R630 2x Intel Xeon E5-2650v4 (48 threads) 128GB
same problem on both servers, both running proxmox 5.4-4
2 containers running on each machine 16 threads each
htop inside CTs shows some of the cores 100% utilization, while the host show far less than that.
Can't monitor cpu utilization, but maybe it's even worse than only cpu stat. Maybe I won't be able to run under heavy load.
some gifs attached
Any ideas what could be wrong?
 

Attachments

  • grabilla.Uh6732.gif
    grabilla.Uh6732.gif
    399.5 KB · Views: 10
  • grabilla.Uh7144.gif
    grabilla.Uh7144.gif
    467 KB · Views: 8
  • grabilla.Uh8648.gif
    grabilla.Uh8648.gif
    384.8 KB · Views: 9
Anyone else faced similar problem?
Looks like an incorrect cpu process queues. Probably because of 2x CPU machine config.
5.4-6 on a single cpu machine has none of that.
 
Can you please check with the following command how the cores of the containers are distributed?
Code:
pct cpusets
Background to this, CTs have their core(s) distributed along the available CPU cores. With many CT or VMs it makes it likely that some other CT or VM is also using the same core. The usage displayed of the core(s) inside the CT is the actual usage of the core on the host. Any other service on the host that ramps up the usage on that particular core will influence the displayed usage.
 
-------------------------------------------------------------------------------------------------------------------------------------
100: 2 5 6 7 10 12 14 15 20 22 28 34 38 41 44 45
101: 0 1 3 4 8 9 11 13 16 17 18 21 25 27 31 32
-------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------
100: 1 2 3 4 5 10 11 12 13 14 15 16 17 18 19 21
101: 0 6 7 8 9 20 22 23 25 27 28 29 34 36 41 46
----------------------------------------------------------------------------------------------------------------------------------------
Can you please check with the following command how the cores of the containers are distributed?
Code:
pct cpusets
Background to this, CTs have their core(s) distributed along the available CPU cores. With many CT or VMs it makes it likely that some other CT or VM is also using the same core. The usage displayed of the core(s) inside the CT is the actual usage of the core on the host. Any other service on the host that ramps up the usage on that particular core will influence the displayed usage.
 
Can you please check with the following command how the cores of the containers are distributed?
Code:
pct cpusets
Background to this, CTs have their core(s) distributed along the available CPU cores. With many CT or VMs it makes it likely that some other CT or VM is also using the same core. The usage displayed of the core(s) inside the CT is the actual usage of the core on the host. Any other service on the host that ramps up the usage on that particular core will influence the displayed usage.
Thank you for attention. The host htop shows there are no processes that utilize 100% of any core. And only some of cores show 0-100% some wok as expected. Gif attached to the firs message can show it more clear.
 
What is the config of those VMs in question?
 
I more meant, the config with 'pct config <vmid>'.
 
pct config 100
arch: amd64
cores: 16
hostname:X
memory: 8192
net0: name=eth0,bridge=vmbr0,firewall=1,gw=,hwaddr=,ip=,type=veth
net1: name=eth1,bridge=vmbr1,firewall=1,hwaddr=,ip=,type=veth
onboot: 1
ostype: debian
rootfs: thinpool:vm-100-disk-0,size=16G
swap: 1024

sudo pct config 101
arch: amd64
cores: 16
hostname:X
memory: 65536
net0: name=eth0,bridge=vmbr0,firewall=1,gw=,hwaddr=,ip=,type=veth
net1: name=eth1,bridge=vmbr1,firewall=1,hwaddr=,ip=,type=veth
onboot: 1
ostype: debian
parent: final
rootfs: thinpool:vm-101-disk-0,size=240G
swap: 1024
 
What else is running on the server? You could use atop to record the node and then compare the results with the container.

Is there anything related in the log files (/var/log/)?
 
What else is running on the server? You could use atop to record the node and then compare the results with the container.

Is there anything related in the log files (/var/log/)?
will do. meanwhile i've found another
server with 2 x Intel(R) Xeon(R) CPU E5-2630 v3 server with proxmox 5.3-9 - no such problem there
 
Could you please post a 'pveversion -v' of the 5.3-9 and 5.4-4 nodes?
 
pct config 100
arch: amd64
cores: 16
hostname:X
memory: 8192
net0: name=eth0,bridge=vmbr0,firewall=1,gw=,hwaddr=,ip=,type=veth
net1: name=eth1,bridge=vmbr1,firewall=1,hwaddr=,ip=,type=veth
onboot: 1
ostype: debian
rootfs: thinpool:vm-100-disk-0,size=16G
swap: 1024

sudo pct config 101
arch: amd64
cores: 16
hostname:X
memory: 65536
net0: name=eth0,bridge=vmbr0,firewall=1,gw=,hwaddr=,ip=,type=veth
net1: name=eth1,bridge=vmbr1,firewall=1,hwaddr=,ip=,type=veth
onboot: 1
ostype: debian
parent: final
rootfs: thinpool:vm-101-disk-0,size=240G
swap: 1024
 
What else is running on the server? You could use atop to record the node and then compare the results with the container.

Is there anything related in the log files (/var/log/)?

nothing in logs. atop shows 68% of mysql
 

Attachments

  • atop.jpg
    atop.jpg
    109.3 KB · Views: 3
pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
pve-manager: 5.4-4 (running version: 5.4-4/97a96833)
pve-kernel-4.15: 5.4-1
pve-kernel-4.15.18-13-pve: 4.15.18-37
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3


pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-45
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-37
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
 
1600% utilization of all the cores in CT exactly every 20seconds.
I only see 16 processes, 15x mysql and 1x htop that display the 1600% (16x100). From the screenshot I don't see anything wrong with the node, just that the container has a good load of work.
 
I only see 16 processes, 15x mysql and 1x htop that display the 1600% (16x100). From the screenshot I don't see anything wrong with the node, just that the container has a good load of work.
really, guys, please help. it's 0 to 100% momentarily and it's 100% on some cores constantly.
need gif?
 

Attachments

  • host_ct.jpg
    host_ct.jpg
    403.9 KB · Views: 5