Skip to content

fix(helm): drop unresolvable _grpc._tcp. SRV prefix from query-backen…#5232

Open
nissessenap wants to merge 2 commits into
grafana:mainfrom
nissessenap:fix_5229
Open

fix(helm): drop unresolvable _grpc._tcp. SRV prefix from query-backen…#5232
nissessenap wants to merge 2 commits into
grafana:mainfrom
nissessenap:fix_5229

Conversation

@nissessenap

Copy link
Copy Markdown

What this PR does

Fixes #5229: in v2 Helm mode, every profile-data read query (SelectMergeStacktraces, SelectSeries) hangs ~30s and returns HTTP 499, so flame graphs render empty. The query-backend component logs nothing — the request never reaches it.

Root cause

The chart rendered:

-query-backend.address=dns:///_grpc._tcp.-headless.$(NAMESPACE_FQDN):9095

This is passed directly to the query-backend client's grpc.NewClient with grpc-go's stock dns resolver. That resolver does an A/AAAA lookup on the literal host; it only does SRV for grpclb (_grpclb._tcp.<host>), and EnableSRVLookups is
false by default. _grpc._tcp.<headless> has an SRV record but no A/AAAA record → zero endpoints. With the client's waitForReady: true service config, calls park until the 30s timeout → HTTP 499.

Metadata RPCs (ProfileTypes, Series, LabelNames, ...) are unaffected because the metastore client uses its own kubernetes:// discovery resolver, not grpc-go's.

Fix

Drop the _grpc._tcp. prefix so the address resolves the headless service's plain A records:

-query-backend.address=dns:///-headless.$(NAMESPACE_FQDN):9095

The headless service is clusterIP: None, so this resolves all ready pod IPs, and the client's existing round_robin LB policy balances across them.

Microservices considered

Both single-binary v2 and microservices v2 rendered the same broken default, so microservices reads were affected too (just not yet reported). The fix is correct for both: single-binary resolves to one pod, microservices round-robins across N
query-backend replicas via the headless A records. Confirmed in the regenerated rendered/*.yaml.

kubernetes:// / dnssrvnoa+ are not options for this address — the query-backend client uses a bare grpc.NewClient with no custom resolver registered (only the metastore client has one).

Backward compatibility

Using extraArgs workaround are unaffected: the chart still skips its default when query-backend.address is set in extraArgs, so no duplicate flag. After upgrading they can drop the override.

…d.address default

In v2 mode the chart rendered the query-backend client address as
`dns:///_grpc._tcp.<headless>:9095`. That address is handed straight to
grpc-go's stock `dns` resolver, which does an A/AAAA lookup on the literal
host. It only performs SRV lookups for grpclb (`_grpclb._tcp.<host>`), and
`EnableSRVLookups` is false by default. The host `_grpc._tcp.<headless>` has
an SRV record but no A/AAAA record, so the resolver yields zero endpoints.
Combined with the client's `waitForReady: true` service config, every read
RPC (SelectMergeStacktraces, SelectSeries) parks until the 30s call timeout
and returns HTTP 499, while the query-backend logs nothing.

Fixes grafana#5229

Signed-off-by: Edvin Norling <edvin.norling@kognic.com>
@nissessenap nissessenap requested a review from a team as a code owner June 3, 2026 11:03
@cla-assistant

cla-assistant Bot commented Jun 3, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helm single-binary v2: default query-backend.address (dns:///_grpc._tcp...) is unresolvable by grpc-go → reads hang 30s / HTTP 499

1 participant