No VM is running, but 8 of 16 CPUs are in 100% usage in htop

Saco

New Member
Aug 22, 2024
12
0
1
As shown in the attachment and described in the title, it's really frustrating to use PVE in this situation, and I don't know what's causing the problem. the system is
PVE 7.4-18, and no VM is running.

Edit:
I found that there is a specific VM can cause the issue, which is shown in my attachments. But I can't tell why.
When I started VMs one by one, only Virtual Machine 103 (OpenWRT-Gateway) caused the issue, and some time, it didn't go away even after I shutdown it.
 

Attachments

  • Snipaste_2024-08-23_09-29-03.jpg
    Snipaste_2024-08-23_09-29-03.jpg
    258.4 KB · Views: 11
  • Snipaste_2024-08-23_09-32-11.jpg
    Snipaste_2024-08-23_09-32-11.jpg
    618.8 KB · Views: 12
  • Snipaste_2024-08-23_09-34-08.jpg
    Snipaste_2024-08-23_09-34-08.jpg
    351.2 KB · Views: 11
Last edited:
The attachment is missing please upload it again.
Can you tell us which processes are using these 8 cores?
 
Has you cross check with top or PVE GUI? htop has problems with incorrect displays.
 
The attachment is missing please upload it again.
Can you tell us which processes are using these 8 cores?
Sorry I just noticed the original attachment was too large. Here's my new screenshot from PVE GUI, htop, and top.
Can not upload a full screen shot just because the image size is too big to upload.
 

Attachments

  • Snipaste_2024-08-23_09-29-03.jpg
    Snipaste_2024-08-23_09-29-03.jpg
    258.4 KB · Views: 22
  • Snipaste_2024-08-23_09-32-11.jpg
    Snipaste_2024-08-23_09-32-11.jpg
    618.8 KB · Views: 21
  • Snipaste_2024-08-23_09-34-08.jpg
    Snipaste_2024-08-23_09-34-08.jpg
    351.2 KB · Views: 21
Has you cross check with top or PVE GUI? htop has problems with incorrect displays.
Yeah, I just did it, and I found that in PVE GUI, the CPU usage is the same with htop. But in top, I can't see a total cpu usage, or process in high cpu usage.
 
I see 25% usage by the OpenWRT VM and dependent processes. There may still be a small overhead if the VM does not use the CPU type host but a masked CPU, but that does not explain the other 25% usage.
What does it look like when the OpenWRT VM is off?
 
I see 25% usage by the OpenWRT VM and dependent processes. There may still be a small overhead if the VM does not use the CPU type host but a masked CPU, but that does not explain the other 25% usage.
What does it look like when the OpenWRT VM is off?
I'v just remove the OpenWRT from PVE, and I found something interesting:
1. I think it's a network-related issue
2. Whatever NIC I used, as soon as I plugged in a network cable, which connected PVE to the internet(*), the CPU goes to about 50%; Then I unplugged it, the CPU goes down, close to 0%. During the whole process, no VM run at all.

about (*): I've tried directly connect to the router, behind a switch, all gave the same result.
 
Strange. Is there perhaps an IP or MAC conflict in the network?
Does this also happen when you connect another machine to the network?
 
Strange. Is there perhaps an IP or MAC conflict in the network?
Does this also happen when you connect another machine to the network?
No, just PVE has this problem. And I don't find a conflict in IP or MAC
 
If it is ok for you, can attach your journal here? Perhaps you will find a clue there.

Part of the journal. Please adjust date and time.
Code:
journalctl --since "2023-12-06 20:00" --until "2023-12-07 03:00" > journal_syslog.txt

Or the whole journal, probably too big for here, but easier to analyse.
Code:
tar -czf journal.tar.gz /var/log/journal
 
If it is ok for you, can attach your journal here? Perhaps you will find a clue there.

Part of the journal. Please adjust date and time.
Code:
journalctl --since "2023-12-06 20:00" --until "2023-12-07 03:00" > journal_syslog.txt

Or the whole journal, probably too big for here, but easier to analyse.
Code:
tar -czf journal.tar.gz /var/log/journal
The following is the out put of journalctl since the time before I plug in the cable until I see the cpu usage goes up to around 50%:
Code:
Aug 23 21:14:12 pve kernel: alx 0000:05:00.0 enp5s0: NIC Up: 1 Gbps Full
Aug 23 21:14:12 pve kernel: vmbr0: port 1(enp5s0) entered blocking state
Aug 23 21:14:12 pve kernel: vmbr0: port 1(enp5s0) entered forwarding state
Aug 23 21:14:19 pve systemd[1]: session-34.scope: Succeeded.
Aug 23 21:14:19 pve systemd-logind[1076]: Session 34 logged out. Waiting for processes to exit.
Aug 23 21:14:19 pve systemd-logind[1076]: Removed session 34.
Aug 23 21:14:19 pve pvedaemon[38434]: <root@pam> end task UPID:pve:0000B3E5:001C4BB9:66C88AD9:vncshell::root@pam: OK
Aug 23 21:14:19 pve pvedaemon[46251]: starting termproxy UPID:pve:0000B4AB:001C6B9F:66C88B2B:vncshell::root@pam:
Aug 23 21:14:19 pve pvedaemon[38434]: <root@pam> starting task UPID:pve:0000B4AB:001C6B9F:66C88B2B:vncshell::root@pam:
Aug 23 21:14:19 pve pvedaemon[42744]: <root@pam> successful auth for user 'root@pam'
Aug 23 21:14:19 pve login[46254]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Aug 23 21:14:19 pve systemd-logind[1076]: New session 35 of user root.
Aug 23 21:14:19 pve systemd[1]: Started Session 35 of user root.
Aug 23 21:14:19 pve login[46259]: ROOT LOGIN  on '/dev/pts/1'
And this is the output since I unplug the cable, until I confirm the usage goes down to around 0%:
Code:
Aug 23 21:22:07 pve kernel: alx 0000:05:00.0 enp5s0: Link Down
Aug 23 21:22:07 pve kernel: vmbr0: port 1(enp5s0) entered disabled state

I don't see anything suspect. Could it be a firewall/Linux bridge issue?
 
Last edited:
Maybe a malicious actor? Some program that runs in the background once it can connect to the internet? Wouldn't have been the first time here that someone reports about a comprimised PVE server where the output of top is obfuscated. Is there any actual traffic on the link when you connect it?
 
Maybe a malicious actor? Some program that runs in the background once it can connect to the internet? Wouldn't have been the first time here that someone reports about a comprimised PVE server where the output of top is obfuscated. Is there any actual traffic on the link when you connect it?
Yes, that would also be a possibility.


I don't see anything suspect. Could it be a firewall/Linux bridge issue?

No, I don't see anything interesting here too. Have you configured anything special?
PVE Firewall, Interfaces, Routing?

Would you like to post your Proxmox network config? If you have also configured the PVE firewall, then also these configs.

PVE 7.4-18, and no VM is running.
I would like to ask again for this: Does the problem now only occur with/and or without a VM running?

Is OpenWRT now also your default gateway for Proxmox or does Proxmox have it's own gateway?

And is there a reason why you are still using PVE7.x? It is EOL, which means there are no more updates.
 
Last edited:
Maybe a malicious actor? Some program that runs in the background once it can connect to the internet? Wouldn't have been the first time here that someone reports about a comprimised PVE server where the output of top is obfuscated. Is there any actual traffic on the link when you connect it?
There are many traffic I detect with iftop, but I don't know whether they are right to be there.
 

Attachments

  • Snipaste_2024-08-23_21-41-33.jpg
    Snipaste_2024-08-23_21-41-33.jpg
    254.9 KB · Views: 13
Yes, that would also be a possibility.




No, I don't see anything interesting here too. Have you configured anything special?
PVE Firewall, Interfaces, Routing?

Would you like to post your Proxmox network config? If you have also configured the PVE firewall, then also these configs.


I would like to ask again for this: Does the problem now only occur with/and or without a VM running?

Is OpenWRT now also your default gateway for Proxmox or does Proxmox have it's own gateway?

And is there a reason why you are still using PVE7.x? It is EOL, which means there are no more updates.
I didn't modify the firewall config. Following is the whole /etc/network/interface config(comments at top are removed):
Code:
auto lo
iface lo inet loopback


iface enp6s0 inet manual
#management_port


iface enp5s0 inet manual


iface wlp7s0 inet manual


auto vmbr0
iface vmbr0 inet static
        address 192.168.93.23/24
        gateway 192.168.93.1
        bridge-ports enp5s0 enp6s0
        bridge-stp off
        bridge-fd 0

As for the condition of this issue, it occurs no matter any/no VM is running, it's about whether the PVE node is connected to the gateway(directly or via a switch). And the gateway is always OpenWRT, previously as a vm in the node, now it's natively running on another machine.
When OpenWRT is a vm in the node, this issue occurs when OpenWRT's WAN is connected to the Internet.

And about the version, I don't have a plan to upgrade right now, but if it's a version issue, I will consider to do this.
 
Last edited:
What nics are in the system, and what drivers are you using?

lspci | grep -i 'eth\|net'

identifying drives is a bit trickier- use lsmod to identify, then use modinfo to get driver details.
The NIC in use:
Code:
root@pve:~# lspci | grep -i 'eth\|net'
05:00.0 Ethernet controller: Qualcomm Atheros Killer E2500 Gigabit Ethernet Controller (rev 10)
06:00.0 Ethernet controller: Qualcomm Atheros Killer E2500 Gigabit Ethernet Controller (rev 10)
07:00.0 Network controller: Intel Corporation Wireless-AC 9260 (rev 29)

And the info from lspci -vv: Kernel driver in use: alx
Code:
05:00.0 Ethernet controller: Qualcomm Atheros Killer E2500 Gigabit Ethernet Controller (rev 10)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Killer E2500 Gigabit Ethernet Controller
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 17
        IOMMU group: 18
        Region 0: Memory at 91400000 (64-bit, non-prefetchable) [size=256K]
        Region 2: I/O ports at 4000 [disabled] [size=128]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn+ AttnInd+ PwrInd+ RBE+ FLReset- SlotPowerLimit 10.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [c0] MSI: Enable- Count=1/16 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d8] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [180 v1] Device Serial Number ff-dc-48-03-00-d8-61-ff
        Kernel driver in use: alx
        Kernel modules: alx

mod detail:
Code:
root@pve:~# modinfo alx
filename:       /lib/modules/6.2.16-20-bpo11-pve/kernel/drivers/net/ethernet/atheros/alx/alx.ko
license:        GPL
description:    Qualcomm Atheros(R) AR816x/AR817x PCI-E Ethernet Network Driver
author:         Qualcomm Corporation
author:         Johannes Berg <johannes@sipsolutions.net>
srcversion:     693BCA3A05C3F4A43DE0D3F
alias:          pci:v00001969d000010A0sv*sd*bc*sc*i*
alias:          pci:v00001969d000010A1sv*sd*bc*sc*i*
alias:          pci:v00001969d00001090sv*sd*bc*sc*i*
alias:          pci:v00001969d0000E0B1sv*sd*bc*sc*i*
alias:          pci:v00001969d0000E0A1sv*sd*bc*sc*i*
alias:          pci:v00001969d0000E091sv*sd*bc*sc*i*
alias:          pci:v00001969d00001091sv*sd*bc*sc*i*
depends:        mdio
retpoline:      Y
intree:         Y
name:           alx
vermagic:       6.2.16-20-bpo11-pve SMP preempt mod_unload modversions
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!