Skip to content

[Issue 1473][Consumer] Fix race in grabConn dropping messages before handler registration#1476

Open
aleks-lazic wants to merge 1 commit intoapache:masterfrom
aleks-lazic:fix/consumer-handler-registration-race-in-grabconn
Open

[Issue 1473][Consumer] Fix race in grabConn dropping messages before handler registration#1476
aleks-lazic wants to merge 1 commit intoapache:masterfrom
aleks-lazic:fix/consumer-handler-registration-race-in-grabconn

Conversation

@aleks-lazic
Copy link
Copy Markdown

Motivation

MESSAGE and ACTIVE_CONSUMER_CHANGE frames sent by the broker immediately after a successful subscribe RPC are silently dropped. The client logs Consumer not found while active consumer change and Got unexpected message, but the frames are permanently lost.

This happens because grabConn() calls AddConsumeHandler after the subscribe RPC returns. The broker starts delivering frames as soon as the subscribe succeeds, but the connection's read goroutine cannot find the handler yet and discards them.

This is a correctness hazard for consumers using AckCumulative: a later message acknowledged cumulatively can implicitly acknowledge the dropped message before the application ever processes it — permanent silent message loss.

Modifications

Split RequestWithCnxKeySuffix (which is internally GetConnection + RequestOnCnx) into its two constituent operations and insert AddConsumeHandler in between, so the handler is registered before the broker can send any frames.

On subscribe failure, DeleteConsumeHandler cleans up the pre-registered handler. The timeout path sends CloseConsumer via RequestOnCnx on the same connection.

This mirrors the existing pattern in producer_partition.go which already does GetConnectionRegisterListenerRequestOnCnx.

Verifying this change

This change added tests and can be verified as follows:

  • TestGrabConn_HandlerRegisteredBeforeSubscribe — handler is in the map before the subscribe RPC
  • TestGrabConn_HandlerRemovedOnSubscribeFailure — no handler leak on error
  • TestGrabConn_HandlerRemovedOnSubscribeTimeout — cleanup on timeout, close sent on same connection
  • TestGrabConn_BrokerFrameDuringSubscribe — broker frame arriving mid-RPC reaches the consumer
  • TestGrabConn_GetConnectionFailure — early return, no handler registered
  • TestGrabConn_AddConsumeHandlerFailure — early return, no RPC sent

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: no
  • The schema: no
  • The default values of configurations: no
  • The wire protocol: no

Documentation

  • Does this pull request introduce a new feature? no

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a consumer subscribe ordering race where broker frames (notably MESSAGE and ACTIVE_CONSUMER_CHANGE) could arrive immediately after (or during) subscribe and be dropped because the consumer handler wasn’t registered on the connection yet.

Changes:

  • Refactors partitionConsumer.grabConn() to explicitly GetConnection and register the consume handler before issuing the subscribe RPC via RequestOnCnx.
  • Adds cleanup on subscribe RPC failure/timeout (delete handler; send close-on-timeout on the same connection).
  • Adds targeted unit tests around handler registration ordering and cleanup behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pulsar/consumer_partition.go Reorders connection acquisition / handler registration vs. subscribe RPC to close the handler-registration race.
pulsar/consumer_partition_test.go Adds new grabConn-focused tests with spy connection/RPC client to validate ordering and cleanup paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pulsar/consumer_partition.go
Comment thread pulsar/consumer_partition.go
Comment thread pulsar/consumer_partition_test.go
@aleks-lazic aleks-lazic force-pushed the fix/consumer-handler-registration-race-in-grabconn branch from 99e7a1a to 02d1718 Compare April 9, 2026 05:58
@aleks-lazic aleks-lazic force-pushed the fix/consumer-handler-registration-race-in-grabconn branch from 02d1718 to f80ea0d Compare April 9, 2026 12:26
@aleks-lazic
Copy link
Copy Markdown
Author

Ready for another review @crossoverJie

@aleks-lazic
Copy link
Copy Markdown
Author

aleks-lazic commented Apr 15, 2026

Failing integration tests are coming from unrelated flaky tests.
--- FAIL: TestTokenAuth (30.00s)
--- FAIL: TestTokenAuthWithClientVersion (30.01s)

cc @crossoverJie @RobertIndie is it possible to get another review on this PR ? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants