Ubuntu VM Offline Cores

ITxD

New Member
Jul 19, 2023
8
0
1
Hi,

we have just installed Proxmox on HPE proliant server with 2cpus(128*2=256|512 vCores). we've created an Ubuntu 24.04 VM and allocated 254Cores(508 vCores) to it.

We've noticed that the vm can see all the cores but only half of the cores are online, see the below output for more information:

Is there anything we're missing here? how can the rest of the cores be enabled?

Thanks

1724363474189.png

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 508
On-line CPU(s) list: 0-253
Off-line CPU(s) list: 254-507
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9754 128-Core Processor
CPU family: 25
Model: 160
Thread(s) per core: 1
Core(s) per socket: 254
Socket(s): 1
Stepping: 2
BogoMIPS: 4493.24
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16
pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp ibrs_enhanced vm
mcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinv
d arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpi
d fsrm flush_l1d arch_capabilities
Virtualization features:
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 15.9 MiB (254 instances)
L1i: 15.9 MiB (254 instances)
L2: 127 MiB (254 instances)
L3: 4 GiB (254 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-253
NUMA node1 CPU(s):
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
(base) vmadmin@pdva-prod-vm3:~$
 
also noticed it's only showing 1 Socket connected only. i tried to lower the core count to 128(256) + 2 sockets. and now i can see that all the cores are online and lscpu is also showing 2 Sockets instead of 1.

---------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9754 128-Core Processor
CPU family: 25
Model: 160
Thread(s) per core: 1
Core(s) per socket: 128
Socket(s): 2
Stepping: 2
BogoMIPS: 4493.24
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16
pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp ibrs_enhanced vm
mcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinv
d arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpi
d fsrm flush_l1d arch_capabilities
Virtualization features:
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 16 MiB (256 instances)
L1i: 16 MiB (256 instances)
L2: 128 MiB (256 instances)
L3: 4 GiB (256 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127
NUMA node1 CPU(s): 128-255
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
 
I don't get it, if you're dedicating the whole CPU landscape to (1) VM, why not run bare metal? What do you need proxmox for in this case?
 
If you start the VM from CLI (qm start VMID), do you get any message or warning? Is there any logs in the system related to this?

I remember a bug report in Ubuntu where the max cpu for a VM was 288 [1]. Seems to be fixed there, but I don't know if it is fixed upstream. I've been unable to find a bugreport on QEMU. Looking at the sources it seems that PVE uses the same 288 cpu max [2] (I might not be looking at the right place, I'm not a devel). Filled a bug report asking about this [3].

@Kingneutron Sometimes you may need to use "the whole machine" for the VM and still be able to take advantage of virtualization: live migrations, backups, life cycle management... There may be other approaches like using more instances of the app in smaller VMs, but sometimes is just too hard to get it right. @ITxD I'm curious too about this exact use case!

[1] https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2012763
[2] https://git.proxmox.com/?p=mirror_q...c18d7de54c112281f74c70c2a699e150;hb=HEAD#l424
[3] https://bugzilla.proxmox.com/show_bug.cgi?id=5671
 
Hi,
do you have x2APIC enabled in your BIOS and not turned off via kernel commandline? If remembering correctly, this is required for AMD CPUs with > 255 threads when a guest is assigned more vCPUs (thanks @mira for the info).
 
  • Like
Reactions: VictorSTS

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!