[Issue 1473][Consumer] Fix race in grabConn dropping messages before handler registration#1476
Open
aleks-lazic wants to merge 1 commit intoapache:masterfrom
Conversation
a1ed665 to
99e7a1a
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a consumer subscribe ordering race where broker frames (notably MESSAGE and ACTIVE_CONSUMER_CHANGE) could arrive immediately after (or during) subscribe and be dropped because the consumer handler wasn’t registered on the connection yet.
Changes:
- Refactors
partitionConsumer.grabConn()to explicitlyGetConnectionand register the consume handler before issuing the subscribe RPC viaRequestOnCnx. - Adds cleanup on subscribe RPC failure/timeout (delete handler; send close-on-timeout on the same connection).
- Adds targeted unit tests around handler registration ordering and cleanup behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pulsar/consumer_partition.go | Reorders connection acquisition / handler registration vs. subscribe RPC to close the handler-registration race. |
| pulsar/consumer_partition_test.go | Adds new grabConn-focused tests with spy connection/RPC client to validate ordering and cleanup paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
99e7a1a to
02d1718
Compare
02d1718 to
f80ea0d
Compare
Author
|
Ready for another review @crossoverJie |
Author
|
Failing integration tests are coming from unrelated flaky tests. cc @crossoverJie @RobertIndie is it possible to get another review on this PR ? thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MESSAGEandACTIVE_CONSUMER_CHANGEframes sent by the broker immediately after a successful subscribe RPC are silently dropped. The client logsConsumer not found while active consumer changeandGot unexpected message, but the frames are permanently lost.This happens because
grabConn()callsAddConsumeHandlerafter the subscribe RPC returns. The broker starts delivering frames as soon as the subscribe succeeds, but the connection's read goroutine cannot find the handler yet and discards them.This is a correctness hazard for consumers using
AckCumulative: a later message acknowledged cumulatively can implicitly acknowledge the dropped message before the application ever processes it — permanent silent message loss.Modifications
Split
RequestWithCnxKeySuffix(which is internallyGetConnection+RequestOnCnx) into its two constituent operations and insertAddConsumeHandlerin between, so the handler is registered before the broker can send any frames.On subscribe failure,
DeleteConsumeHandlercleans up the pre-registered handler. The timeout path sendsCloseConsumerviaRequestOnCnxon the same connection.This mirrors the existing pattern in
producer_partition.gowhich already doesGetConnection→RegisterListener→RequestOnCnx.Verifying this change
This change added tests and can be verified as follows:
TestGrabConn_HandlerRegisteredBeforeSubscribe— handler is in the map before the subscribe RPCTestGrabConn_HandlerRemovedOnSubscribeFailure— no handler leak on errorTestGrabConn_HandlerRemovedOnSubscribeTimeout— cleanup on timeout, close sent on same connectionTestGrabConn_BrokerFrameDuringSubscribe— broker frame arriving mid-RPC reaches the consumerTestGrabConn_GetConnectionFailure— early return, no handler registeredTestGrabConn_AddConsumeHandlerFailure— early return, no RPC sentDoes this pull request potentially affect one of the following parts:
Documentation