Nested ESXi Virtualisation

TecScott

Renowned Member
Mar 30, 2017
29
0
66
35
I've seen articles/posts regarding nested ESXi virtualisation but I seem to have an issue with purple screens when writing data to a second hard drive.

If I run the ESXi host on its own it runs without any issues, however when copying data to the nested host it'll randomly purple screen referencing a PCPU lock.

PCPU 1 locked up. Failed to ack TLB invalidate (at least 1 locked up, PCPU(s): 1).
PCPU(s) did not respond to NMI. Possible hardware problem; contact hardware vendor.

The local debugger after PSOD shows scsi aborts:

scsiTaskMgmtCommand:VMK Task: ABORT sn=0x8bddb initiator=0x43024ff900
ahciAbortIO: (curr) HWQD: 4 BusyL: 0 PioL: 0
scsiTaskMgmtCommand:VMK Task VIRT_RESET initiator=0x43024ff900
ahciAbortIO: (curr) HWQD: 4 BusyL: 0 PioL: 0
'Shared': HB at offset 3866624 - Waiting for timed out HB:
[HB state abcdef02 offset 3866624 gen 103 stampUS....
nmp_ThrottleLogForDevice:3863: Cmd 0x2a (0x459a4259a0c0, 2097165) to dev "t10.ATA___QEMU_HARDDISK__________QM00015_________" on path "vmhba1:C0:T1:L0" Failed:
nmp_ThrottleLogForDevice:3872: H:0x5 D:0x22 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL. cmdId.initiator=0x403024ff900 CmdSN 0x8bddb
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA___QEMU_HARDDISK_______QM00015_______" state in doubt; requested fast path state update...
.... CmdSN 0x8bddb from world 2097165 to dev "t10.ATA____QEMU_HARDDISK________QM00015______" failed
After which the PCPU's don't perform a heartbeat and results in the VM crashing.

Anyone encountered a similar issue with nested ESXi?

CPU is set to host, Machine set to q35, SCSI Controller I've tried both VMware PVSCSI and LSI 53C895A, SeaBIOS, and OS type set to Linux 5.x - 2.6 Kernel
 
I run the nested ESXi with vHDD attached to SATA, not SCSI. Have you tried that?
Thanks - it is attached via SATA unfortunately - when attached via SCSI the disk doesn't appear at all on ESXi

Current settings:
4GB, 4 core, host CPU, SeaBIOS, i440fx (tried q35), VMware PVSCSI SCSI Controller (tried default), SATA disk (tried scsi), vmxnet3 NIC's, Other OS Type (tried Linux 5.x - 2.6 Kernel).

The server will run fine idle, but when copying data to it, it'll hit the issue and PSOD. I've tried rebuilding and a new VMFS datastore too which made no difference.
 
What about the CPU settings? I use this and it works for some time. A puple screen every month or so, but no disk I/O issues:

1674400895728.png
 
I do have more or less the smae config, but I do get the PSOD regularly every day. It looks like it happens mostly when the systems are idle enough.
Is there anything I can test to figure out what's causing it?
1681981190708.png

Tried it with multiple hardware and systems supported in the compatiblity matrix. But it's always the same.
This is my hardware config:

1681981296392.png
 
Last edited:
Did you try using UEFI bios on the VM?

Also, how much ram/cpu cores does the host have? what else is running on the host?
 
Did you try using UEFI bios on the VM?

Also, how much ram/cpu cores does the host have? what else is running on the host?
No, didn't try UEFI, that's definitely something I can try, but to be honest, I doubt, that this will change anything. Still worth a try.
Concerning RAM/CPU cores. I've tried up to 24 cores and 64GB Mem but nothing changed the behavior.
 
Hey, just FYI I get this on another distro on KVM, same scenario.
ESXi 8 VM has 16core / 256GB, nested VM 8 core / 16GB. I can reproduce it at the moment while installing Oracle Linux 8.9
I'll post an update if I find a solution.
Originally I thought I had worked around it by picking more stable timers and basic vcpu pinning but it's not completely over. Host has ... many ... cores and RAM. No other VM running that has any relevant load impact. Disk is NVMe backed.
It is absolutely a timing / scheduling issue, just not clear what exactly is the culprit. since you also tried i440 instead of q35 that's also excluded already...

It could be possible that pinning the guest vCPUs in VMware would work around it. I forgot how to do that, so that's not gonna help.

This is the relevant timing stuff.

Code:
    <acpi/>
    <apic/>
    <pmu state="off"/>
    <vmport state="off"/>
  <kvm>
    <hidden state="on"/>
  </kvm>
  </features>
  <cpu mode="host-passthrough" check="none" migratable="off">
    <topology sockets="1" dies="1" clusters="1" cores="8" threads="2"/>
    <feature policy="require" name="topoext"/>
    <feature policy="require" name="tsc-deadline"/>
  </cpu>
  <clock offset="utc">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
  </clock>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <vcpu placement="static">16</vcpu>
  <iothreads>1</iothreads>
 
Last edited:
i have it down to three? potential 'fixes', ordered by guessed probability
  1. hugepages (seems needed, plausible cause)
  2. vmware sata driver issues vs. kvm q35 sata (switch esxi to nfs datastore, hdd and ssd backed)
  3. clean / careful cpu numa pinning (did, #cores, #threads, #memory from node)
  4. static vcpu placement w/o isolated cpu groups being detrimental (but the idea that linux ends up scheduling stuff exactly on those cores with 100+ others free makes no sense)
still need to do more timer testing afterwards (invtsc, set tsc frequency)
 
Last edited:
ok. i got a ESXi VM that no longer crashes. I was really worried with regard to SATA but for all can see, my current one works with SATA virtual disks and still managed to not crash.
So I very much think it's hugepages or less likely NUMA and pinning.

I/O throughput in the guest is not great but "reasonable":
Code:
[root@localhost ~]# dd if=/dev/sda of=/dev/null bs=1024k count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.98394 s, 541 MB/s

I got around 1GB/s on emulated Hyper-V and XCP-NG so there might be something that one can still do with iothreads etc. but I'm really not keen on doing the added test rounds for that in my free time.
also I tried things like yum upgrade (1100ish packages) and it just isn't slow, it feels ok.

Here's my template from OpenNebula, I suppose it is very easy to replicate in Proxmox, so don't let the irrelevant parts distract you.


Code:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
    <name>one-448</name>
    <title>esxi-restore</title>
    <uuid>1f817298-24fc-4c4e-88fe-614e711671a3</uuid>
    <vcpu><![CDATA[16]]></vcpu>
    <cputune>
        <shares>16384</shares>
        <vcpupin vcpu='0' cpuset='72'/>
        <vcpupin vcpu='1' cpuset='168'/>
        <vcpupin vcpu='2' cpuset='73'/>
        <vcpupin vcpu='3' cpuset='169'/>
        <vcpupin vcpu='4' cpuset='74'/>
        <vcpupin vcpu='5' cpuset='170'/>
        <vcpupin vcpu='6' cpuset='75'/>
        <vcpupin vcpu='7' cpuset='171'/>
        <vcpupin vcpu='8' cpuset='76'/>
        <vcpupin vcpu='9' cpuset='172'/>
        <vcpupin vcpu='10' cpuset='77'/>
        <vcpupin vcpu='11' cpuset='173'/>
        <vcpupin vcpu='12' cpuset='78'/>
        <vcpupin vcpu='13' cpuset='174'/>
        <vcpupin vcpu='14' cpuset='79'/>
        <vcpupin vcpu='15' cpuset='175'/>
        <emulatorpin cpuset='72,168,73,169,74,170,75,171,76,172,77,173,78,174,79,175'/>
    </cputune>
    <memory>134217728</memory>
    <os>
        <type arch='x86_64' machine='pc-q35-10.0'>hvm</type>
        <loader readonly="yes" type="pflash" secure="no">/usr/share/OVMF/OVMF_CODE.fd</loader>
        <nvram>/var/lib/one//datastores/0/448/esxi-restore_VARS.fd</nvram>
    </os>
    <pm>
        <suspend-to-disk enabled="no"/>
        <suspend-to-mem enabled="no"/>
    </pm>
    <cpu mode='host-passthrough'>
        <topology sockets='1' cores='8' threads='2'/>
        <numa>
            <cell id='0' memory='134217728' cpus='0-15'/>
        </numa>
    </cpu>
    <numatune>
        <memnode cellid='0' mode='strict' nodeset='3'/>
        <memory mode='strict' nodeset='3'/>
    </numatune>
    <memoryBacking>
        <hugepages>
            <page size='2048'/>
        </hugepages>
    </memoryBacking>
    <devices>
        <emulator><![CDATA[/opt/components/qemu/bin/qemu-system]]></emulator>
        <disk type='file' device='disk'>
            <source file='/var/lib/one//datastores/0/448/disk.0'/>
            <target dev='sda' bus='sata'/>
            <boot order='1'/>
            <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
            <address type='drive' controller='0' bus='0' target='0' unit='0'/>
        </disk>
        <disk type='file' device='disk'>
            <source file='/var/lib/one//datastores/0/448/disk.1'/>
            <target dev='sdb' bus='sata'/>
            <driver name='qemu' type='raw' cache='unsafe' discard='unmap'/>
            <address type='drive' controller='1' bus='0' target='0' unit='0'/>
        </disk>
        <interface type='bridge'>
            <virtualport type='openvswitch'/>
            <source bridge='ovsbr0'/>
            <mac address='02:00:b4:3d:b6:3d'/>
            <target dev='one-448-0'/>
            <model type='vmxnet3'/>
        </interface>
        <graphics type='vnc' listen='0.0.0.0' port='10448'/>
    </devices>
    <features>
        <acpi/>
    </features>
    <devices>
        <channel type='unix'>
            <source mode='bind'/><target type='virtio' name='org.qemu.guest_agent.0'/>
        </channel>
    </devices>
    <devices>
        <controller type='scsi' index='0' model='virtio-scsi'>
            <driver queues='1'/>
        </controller>
    </devices>
      <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>


  <devices>
    <memballoon model="none"/>
    <rng model="virtio">
      <backend model="random">/dev/urandom</backend>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x0b" function="0x0"/>
    </rng>
  </devices>
    <metadata>
        <one:vm xmlns:one="http://opennebula.org/xmlns/libvirt/1.0">
            <one:system_datastore><![CDATA[/var/lib/one//datastores/0/448]]></one:system_datastore>
            <one:name><![CDATA[esxi-restore]]></one:name>
            <one:uname><![CDATA[admin]]></one:uname>
            <one:uid>2</one:uid>
            <one:gname><![CDATA[oneadmin]]></one:gname>
            <one:gid>0</one:gid>
            <one:opennebula_version>6.2.0</one:opennebula_version>
            <one:stime>1762986276</one:stime>
            <one:deployment_time>1763076725</one:deployment_time>
        </one:vm>
    </metadata>
</domain>

If someone tries and it doesn't work, I can try your settings as a cross reference.
Tuning wise I can see I'm missing cache-mode passthrough and such, one thing that matters with NUMA that it really passes through the hyperthreaded cores, so ESXi can know it's not hitting 16 "full" cores, but that they share some components. This could certainly make scheduling easier on _both_ Linux/KVM and ESXi.
 
Last edited: