Skip to content

Large scratch regions on KVM with hw-interrupts fail with EEXIST due to APIC access page overlap #1389

@ppenna

Description

@ppenna

Summary

See nanvix/nanvix#2082 for more context on why this is relevant for Nanvix on Hyperlight.

When a guest configures a large scratch region via SandboxConfiguration::set_scratch_size(), HyperlightVm::new() fails on KVM with EEXIST (Error(17)):

UpdateRegion(MapMemory(Hypervisor(KvmError(Error(17)))))

The root cause is that the scratch memory slot (KVM slot 1) overlaps with an internal KVM memory slot created by create_irq_chip() for the LAPIC/APIC access page at GPA 0xFEE00000.

Details

Scratch region placement

Hyperlight places the scratch region at the top of the 32-bit GPA space via scratch_base_gpa():

// hyperlight_common::layout
pub fn scratch_base_gpa(size: usize) -> u64 {
    (MAX_GPA - size + 1) as u64
}

With MAX_GPA = 0xFFFF_FFFF, a scratch size of e.g. 0x6882000 (~104 MB) yields:

  • scratch_base = 0xF977E000
  • Scratch KVM slot covers GPA range [0xF977E000, 0xFFFFFFFF]

KVM irqchip APIC access page

When the hw-interrupts feature is enabled, KvmVm::new() calls create_irq_chip(). On Intel hardware with APICv (or on AMD with AVIC), KVM automatically creates an internal APIC access page at GPA 0xFEE00000. This is a non-removable, non-movable memory slot managed internally by KVM.

Since 0xFEE00000 falls inside [0xF977E000, 0xFFFFFFFF], KVM rejects the set_user_memory_region call for the scratch slot with EEXIST — the two regions overlap.

Maximum safe scratch size

The maximum scratch size that avoids the APIC page is:

max_scratch = MAX_GPA - 0xFEE00000 = 0x11FFFFF ≈ 18 MB

Any set_scratch_size() value above ~18 MB will fail on KVM with hw-interrupts enabled on Intel (APICv) or AMD (AVIC) hosts.

Why this does not affect Windows WHP

On Windows, WhpVm::new() does not create an explicit interrupt controller memory slot at a fixed GPA. The WHP API (WHvMapGpaRange2) maps guest physical address ranges independently, and the platform's LAPIC emulation does not reserve a GPA slot that conflicts with user-mapped regions. This is why the same scratch configuration works on Windows.

Why this does not affect small scratch sizes

The default scratch size (DEFAULT_SCRATCH_SIZE = 0x48000 = 288 KB) places scratch at 0xFFFB8000, which is above 0xFEE00000, so there is no overlap.

Reproduction

This can be reproduced with the Nanvix project, which uses Hyperlight as a VMM backend:

  1. Clone and checkout branch enhancement-uservm-hyperlight at commit b9c50ed28 (uses Hyperlight rev 4b57b84):

    git clone https://github.com/nanvix/nanvix.git
    cd nanvix
    git checkout enhancement-uservm-hyperlight
  2. Build with Hyperlight machine target:

    ./z build -- all MACHINE=hyperlight DEPLOYMENT_MODE=standalone
  3. Run the integration test on a machine with KVM and APICv enabled (Intel bare-metal):

    ./bin/mkimage.elf -o nanvix.img \
        "bin/procd.elf;procd" \
        "bin/memd.elf;memd" \
        "bin/testd.elf;testd"
    bash scripts/run-nanvixd.sh hyperlight nanvix.img 120 \
        --wait-for-string "hello, world!"

    On bare-metal Intel with APICv, nanvixd fails immediately with the EEXIST error. On machines with APICv disabled (e.g., WSL2) or on Windows WHP, it succeeds.

    The failing CI run: https://github.com/nanvix/nanvix/actions/runs/24616269717

Proposed solutions

  1. Split the scratch KVM slot around the APIC page: When creating the scratch memory mapping on KVM with hw-interrupts, detect whether [scratch_base, scratch_end] contains 0xFEE00000 and split it into two KVM memory slots: [scratch_base, 0xFEDFFFFF] and [0xFEF00000, scratch_end]. The APIC page itself (0xFEE00000–0xFEEFFFFF) would be left for KVM's internal slot. The host-side mmap backing would remain contiguous; only the KVM slot registration would be split.

  2. Validate scratch_size against known reserved GPAs: In SandboxMemoryLayout::new(), reject scratch sizes that would cause scratch_base_gpa() to fall below 0xFEE00000 when hw-interrupts is enabled on KVM. This would at least provide a clear error message instead of an opaque KvmError(Error(17)).

  3. Document the maximum scratch size constraint: Add a note to SandboxConfiguration::set_scratch_size() and DEFAULT_SCRATCH_SIZE explaining the ~18 MB upper bound on KVM with hw-interrupts.

  4. Disable APICv on the host: Consumers running KVM on Intel can work around this by disabling APICv (sudo modprobe kvm_intel enable_apicv=0), which prevents KVM from allocating the APIC access page. This eliminates the overlap but comes at a performance cost — APIC accesses fall back to VM-exit based emulation instead of hardware-accelerated handling. This is a viable short-term workaround but does not require any Hyperlight changes.

Option 1 would be the most flexible, allowing scratch regions of arbitrary size on all platforms. Options 2 and 3 are simpler but limit the usable scratch space. Option 4 is a host-side workaround that does not require any Hyperlight changes.

Environment

  • Hyperlight revision: 4b57b8416114c489083922afa3dd9716127278fb
  • Features: kvm, hw-interrupts, nanvix-unstable, executable_heap
  • Host: bare-metal Intel x86_64, Linux, KVM with APICv enabled
  • Fails: Intel bare-metal runners (prometheus28, prometheus30, prometheus43)
  • Works: WSL2 (APICv disabled), Windows 11 WHP

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/confirmedBug is verified or proposal seems reasonable

    Type

    Projects

    Status

    Done

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions