Skip to content

CI: Add eBPF plugin validation across multiple kernel versions #2205

@SRodi

Description

@SRodi

Problem

Retina's eBPF plugins (dropreason, packetforward, dns) are tightly coupled to kernel internals — function signatures, tracepoint layouts, and BTF availability can change across kernel releases. Issue #1906 demonstrated this: inet_csk_accept changed its signature in Linux 6.10-rc1, silently breaking the dropreason plugin on newer kernels.

Today there is no automated CI coverage for kernels beyond what AKS ships (currently 6.6 LTS on AzureLinux 3, 6.8 on Ubuntu 24.04). This means breakage on newer kernels is only discovered manually and after the fact.

Proposal

Add a CI job (or scheduled workflow) that validates eBPF plugin loading and basic metric collection across a matrix of kernel versions. Specifically:

Kernel matrix

At minimum, cover the following kernel families:

Kernel Source Rationale
5.15 LTS Ubuntu 22.04 Oldest supported LTS baseline
6.1 LTS AzureLinux 2 / Debian 12 Current AzureLinux 2 kernel
6.6 LTS AzureLinux 3 Current AzureLinux 3 kernel
6.8 Ubuntu 24.04 Current AKS Ubuntu kernel
6.10+ Ubuntu 24.04 HWE First kernel with inet_csk_accept signature change
Latest stable kernel.org Catch upcoming breakage early

Approach

We could use kind on a host with the target kernel, deploy retina via Helm, and validate that metrics are collected. This pattern can be generalized:

  1. GitHub Actions matrix job using VMs or containers with different kernels. Options include:

    • cilium/little-vm-helper — lightweight QEMU-based kernel testing (used by Cilium for similar eBPF CI)
    • Self-hosted runners with specific OS images
    • Azure VMs with HWE kernels (as in validate-dropreason-azure-vm-6.10.sh)
  2. Validation checks per kernel (not just dropreason):

    • All eBPF programs load without verifier errors
    • Metrics endpoint exposes expected metric families (networkobservability_drop_count, networkobservability_forward_count, etc.)
    • No unexpected errors in agent logs
  3. Scheduled + PR-triggered:

    • Run the full kernel matrix on a schedule (e.g., nightly or weekly)
    • On PRs that touch pkg/plugin/*/_cprog/, run at least the LTS kernels + latest stable

Stretch goals

  • BTF compatibility checks — validate that CO-RE relocations succeed on each target kernel's BTF
  • Kernel release tracking — automated alerts when a new stable kernel is tagged that hasn't been tested yet
  • Performance regression — compare eBPF program instruction counts across kernels to catch verifier complexity regressions

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions