Skip to content

Hubble metric identity resolution incomplete for remote (cross-node) pod endpoints #2182

@pesarkhobeee

Description

@pesarkhobeee

Description

When running Retina with Hubble metrics enabled on GKE (without Cilium as CNI), the source and destination labels in Hubble metrics (e.g., hubble_tcp_flags_total, hubble_flows_processed_total,
hubble_dns_queries_total) are only populated for pods local to the Retina agent's node. Remote pods (on other nodes) get source="" or destination="", even though their CiliumIdentity and
CiliumEndpoint CRDs exist with correct labels.

The source_namespace / destination_namespace labels resolve correctly for all pods (local and remote) because they come from the cluster-wide CiliumIdentity CRDs. But the app-level identity resolution
(via sourceEgressContext=app or labelsContext=source_app) fails for remote endpoints.

This means no single Hubble metric series has both source and destination populated for cross-node traffic, making per-service dashboards incomplete.

Environment

  • Retina version: v1.1.0
  • Chart: oci://ghcr.io/microsoft/retina/charts/retina-hubble v1.1.0
  • Kubernetes: GKE (Google Kubernetes Engine), europe-west3
  • CNI: GKE default (not Cilium)
  • Nodes: 35+ nodes (GKE NAP autoscaling)

Hubble metrics configuration

hubble:
  metrics:
    enabled:
      - "flow:sourceEgressContext=app|pod-name;destinationIngressContext=app|pod-name;labelsContext=source_namespace,destination_namespace,source_app,destination_app"
      - "tcp:sourceEgressContext=app|pod-name;destinationIngressContext=app|pod-name;labelsContext=source_namespace,destination_namespace,source_app,destination_app"
      - "drop:sourceEgressContext=app|pod-name;destinationIngressContext=app|pod-name;labelsContext=source_namespace,destination_namespace,source_app,destination_app"
      - "dns:query;sourceEgressContext=app|pod-name;destinationIngressContext=app|pod-name;labelsContext=source_namespace,destination_namespace,source_app,destination_app"

Steps to reproduce

  1. Deploy Retina v1.1.0 with Hubble metrics on a multi-node GKE cluster (no Cilium CNI)
  2. Ensure pods have app.kubernetes.io/name labels
  3. Verify CiliumIdentity CRDs contain the correct label:
    kubectl get ciliumidentity -o jsonpath='{.security-labels}' | tr ',' '\n' | grep app # Output: "k8s:app.kubernetes.io/name":"my-service"
  4. Scrape Hubble metrics from a Retina agent:
    curl -s http://:9965/metrics | grep hubble_tcp_flags
  5. Observe that source is populated for local pods but empty for remote pods, and vice versa for destination.

Actual behavior

From retina-agent on Node A (where dispatching runs):
hubble_tcp_flags_total{source="dispatching", source_namespace="consumer-backend", destination="", destination_namespace="core-services", flag="SYN"} 42

From retina-agent on Node B (where pricing-web runs):
hubble_tcp_flags_total{source="", source_namespace="consumer-backend", destination="pricing-web", destination_namespace="core-services", flag="SYN"} 42

Labels that DO resolve for remote pods: source_namespace, destination_namespace
Labels that DON'T resolve for remote pods: source, destination, source_app, destination_app

source_workload / destination_workload (via labelsContext) are always empty for both local and remote pods — the workload-name context also never resolves.

Expected behavior

Both source and destination labels should resolve for all pods in the cluster, since their CiliumIdentity CRDs exist with the correct app.kubernetes.io/name labels. The Retina operator creates these CRDs
correctly, the issue is in the Hubble observer's identity cache not using them for remote endpoint resolution.

Root cause analysis

The Retina operator correctly creates CiliumEndpoint and CiliumIdentity CRDs for all pods. However, the Hubble observer inside each Retina agent appears to only maintain a complete IP → endpoint → identity
mapping for local pods. Remote pod IPs are matched to a CiliumIdentity (giving namespace), but the full identity label lookup (needed for app context resolution) fails.

In upstream Cilium, the Cilium agent maintains a complete cluster-wide identity cache via the kvstore or CRD-backed identity allocator. Retina's Hubble observer doesn't appear to build an equivalent
cluster-wide cache from the CiliumEndpoint/CiliumIdentity CRDs it has access to.

Impact

  • Datadog dashboards filtering by service name (source/destination tags) show incomplete data
  • Service dependency tables have empty service name columns for cross-node traffic
  • Per-service TCP error rate calculations are inaccurate (some traffic attributed to source="")
  • Engineers must rely on source_namespace/destination_namespace filtering which is less precise

Workaround

Filter dashboards by namespace instead of service name. This works but doesn't distinguish between multiple services in the same namespace.

Suggested fix

Have the Retina agent's Hubble observer build a cluster-wide identity cache from CiliumEndpoint and CiliumIdentity CRDs (which the Retina operator already creates), similar to how upstream Cilium's agent
populates its identity cache. This would allow sourceEgressContext=app and destinationIngressContext=app to resolve correctly for all pods regardless of which node they run on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions