Server 100% CPU and cannot access storage

mikiudon

Member
Apr 13, 2022
17
1
8
Hi guys, server went 100% CPU, all VMs unable to access - though it still was active. Seems the network gave up. Pls see log below.

Code:
Jun  7 22:51:49 imm4 pvestatd[1691]: status update time (12.329 seconds)
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable
Jun  7 22:51:53 imm4 systemd[1]: session-430.scope: Succeeded.
Jun  7 22:51:53 imm4 systemd[1]: session-430.scope: Consumed 1min 34.922s CPU time.
Jun  7 22:51:54 imm4 systemd[1]: Started Session 431 of user root.
Jun  7 22:51:54 imm4 systemd[1]: session-431.scope: Succeeded.
Jun  7 22:51:55 imm4 pvestatd[1691]: storage 'vmware2' is not online
Jun  7 22:51:57 imm4 systemd[1]: Started Session 432 of user root.
Jun  7 22:51:58 imm4 pvestatd[1691]: storage 'nfstb4' is not online
Jun  7 22:51:58 imm4 systemd[1]: session-432.scope: Succeeded.
Jun  7 22:51:58 imm4 systemd[1]: session-432.scope: Consumed 1.632s CPU time.
Jun  7 22:51:58 imm4 pmxcfs[1556]: [status] notice: received log
Jun  7 22:52:00 imm4 kernel: [756723.914202] nfs: server 10.254.10.211 not responding, timed out
Jun  7 22:52:01 imm4 pvestatd[1691]: storage 'smb' is not online
Jun  7 22:52:01 imm4 pvestatd[1691]: status update time (12.013 seconds)
Jun  7 22:52:04 imm4 pvestatd[1691]: storage 'nfstb4' is not online
Jun  7 22:52:08 imm4 systemd[1]: Stopping User Manager for UID 0...
Jun  7 22:52:08 imm4 systemd[3114009]: Stopped target Main User Target.
Jun  7 22:52:08 imm4 systemd[3114009]: Stopped target Basic System.
Jun  7 22:52:08 imm4 systemd[3114009]: Stopped target Paths.
Jun  7 22:52:08 imm4 systemd[3114009]: Stopped target Sockets.
Jun  7 22:52:08 imm4 systemd[3114009]: Stopped target Timers.
Jun  7 22:52:08 imm4 systemd[3114009]: dirmngr.socket: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Closed GnuPG network certificate management daemon.
Jun  7 22:52:08 imm4 systemd[3114009]: gpg-agent-browser.socket: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jun  7 22:52:08 imm4 systemd[3114009]: gpg-agent-extra.socket: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jun  7 22:52:08 imm4 systemd[3114009]: gpg-agent-ssh.socket: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jun  7 22:52:08 imm4 systemd[3114009]: gpg-agent.socket: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Closed GnuPG cryptographic agent and passphrase cache.
Jun  7 22:52:08 imm4 systemd[3114009]: Removed slice User Application Slice.
Jun  7 22:52:08 imm4 systemd[3114009]: Reached target Shutdown.
Jun  7 22:52:08 imm4 systemd[3114009]: systemd-exit.service: Succeeded.
Jun  7 22:52:08 imm4 systemd[3114009]: Finished Exit the Session.
Jun  7 22:52:08 imm4 systemd[3114009]: Reached target Exit the Session.
Jun  7 22:52:08 imm4 systemd[1]: user@0.service: Succeeded.
Jun  7 22:52:08 imm4 systemd[1]: Stopped User Manager for UID 0.
Jun  7 22:52:08 imm4 systemd[1]: Stopping User Runtime Directory /run/user/0...
Jun  7 22:52:08 imm4 systemd[1]: run-user-0.mount: Succeeded.
Jun  7 22:52:08 imm4 systemd[1]: user-runtime-dir@0.service: Succeeded.
Jun  7 22:52:08 imm4 systemd[1]: Stopped User Runtime Directory /run/user/0.
Jun  7 22:52:08 imm4 systemd[1]: Removed slice User Slice of UID 0.
Jun  7 22:52:08 imm4 systemd[1]: user-0.slice: Consumed 1min 39.806s CPU time.

After this log, all the network went down and storage complained not responsive

Code:
Jun  7 22:52:10 imm4 kernel: [756733.642238] nfs: server 10.254.10.211 not responding, timed out
Jun  7 22:52:10 imm4 pvestatd[1691]: storage 'vmware2' is not online
Jun  7 22:52:13 imm4 pvestatd[1691]: storage 'smb' is not online
Jun  7 22:52:14 imm4 pvestatd[1691]: status update time (12.529 seconds)
Jun  7 22:52:18 imm4 kernel: [756741.582281] nfs: server 10.254.10.211 not responding, timed out
Jun  7 22:52:19 imm4 pvestatd[1691]: storage 'vmware2' is not online
Jun  7 22:52:23 imm4 pvestatd[1691]: storage 'nfstb4' is not online

Then this

Code:
Jun  7 23:05:44 imm4 pvestatd[1691]: VM 163 qmp command failed - VM 163 qmp command 'query-proxmox-support' failed - unable to connect to VM 163 qmp socket - timeout after 31 retries
Jun  7 23:05:47 imm4 pvestatd[1691]: VM 171 qmp command failed - VM 171 qmp command 'query-proxmox-support' failed - got timeout
Jun  7 23:05:50 imm4 pvestatd[1691]: storage 'smb' is not online
Jun  7 23:05:53 imm4 pvestatd[1691]: got timeout
Jun  7 23:05:53 imm4 pvestatd[1691]: unable to activate storage 't16v2' - directory '/mnt/pve/t16v2' does not exist or is unreachable
Jun  7 23:05:54 imm4 kernel: [757556.941891] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:05:54 imm4 pvestatd[1691]: storage 'nfstb4' is not online
Jun  7 23:06:00 imm4 pvestatd[1691]: storage 'vmware2' is not online
Jun  7 23:06:00 imm4 pvestatd[1691]: status update time (21.502 seconds)
Jun  7 23:06:03 imm4 kernel: [757566.413892] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:08 imm4 kernel: [757571.533927] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:12 imm4 pvestatd[1691]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - got timeout
Jun  7 23:06:13 imm4 kernel: [757576.909969] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:15 imm4 pvestatd[1691]: VM 171 qmp command failed - VM 171 qmp command 'query-proxmox-support' failed - unable to connect to VM 171 qmp socket - timeout after 31 retries
Jun  7 23:06:18 imm4 pvestatd[1691]: VM 163 qmp command failed - VM 163 qmp command 'query-proxmox-support' failed - unable to connect to VM 163 qmp socket - timeout after 31 retries
Jun  7 23:06:19 imm4 kernel: [757582.030002] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:21 imm4 pvestatd[1691]: storage 'smb' is not online
Jun  7 23:06:23 imm4 pvestatd[1691]: got timeout
Jun  7 23:06:23 imm4 pvestatd[1691]: unable to activate storage 't16v2' - directory '/mnt/pve/t16v2' does not exist or is unreachable
Jun  7 23:06:24 imm4 kernel: [757587.153989] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:24 imm4 pvestatd[1691]: storage 'nfstb4' is not online
Jun  7 23:06:30 imm4 pvestatd[1691]: storage 'vmware2' is not online
Jun  7 23:06:30 imm4 pvestatd[1691]: status update time (30.726 seconds)
Jun  7 23:06:34 imm4 kernel: [757597.390033] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:35 imm4 pmxcfs[1556]: [status] notice: received log
Jun  7 23:06:39 imm4 kernel: [757602.510081] nfs: server 10.254.10.211 not responding, timed out
Jun  7 23:06:43 imm4 pvestatd[1691]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries

Tried to 'reboot' and it took a while because all of these weren't responding. When it came back up, it booted all the VMs and it looks fine now.

I have 2 more hosts which are on the same settings but all good during this time.

Any idea what's going which caused it to down the network?

pve-manager/7.2-3/c743d6c1
Server has 512GB RAM, 56 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (2 Sockets)

Ignore the messages nfstb4, vmware2,smb not online ... those are old storage which I am meaning to clean up. The real one is t16v2. NFS backend to Truenas.

The ones I would like to understand is

Jun 7 22:51:53 imm4 QEMU[3114093]: kvm: warning: TSC frequency mismatch between VM (2499999 kHz) and host (2596991 kHz), and TSC scaling unavailable


How come this one will then trigger a network collapse and stop all user services etc?
 
Last edited: