rpc: de-flake Server.OverloadTest{,NoCreateSpecialMessage}#209
Merged
Conversation
`ASSERT_EQ(100, succeeded)` over N racing calls against a 100-slot cap is timing-dependent: the handler slept a fixed 2s, so on a busy/slow runner the first sleepy handlers finish and free slots before all N calls are submitted, letting more than the cap succeed (CI saw 300/400). Being a *fatal* assert, the failure also skipped `gate->Stop()`/`server.Stop()`, leaving in-flight calls and crashing the next test (SIGSEGV). Make it deterministic: SleepyEchoService now *holds* its concurrency slot until released (bumping an `admitted` counter on entry). Each test fires exactly the cap (100) with a long deadline, waits until the cap is full, then floods the rest with a short deadline -- all dropped, since the cap is held. Once those drain, it releases the held calls: precisely the cap's worth succeed, no extra slips into a freed slot, and nothing times out -- independent of runner speed. Also switch to EXPECT_EQ so a failure can't skip teardown. Verified: passes 10/10 under heavy CPU load (was consistently failing on the busy CI runner); ~1.1s each (down from ~3.1s).
46d24e4 to
fd8ec71
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Server.OverloadTestNoCreateSpecialMessage(andOverloadTest) flake on busy CI runners — the Bazel lane failed consistently withserver_test.cc:261: ASSERT_EQ(100, succeeded)→Which is: 300/400, followed by a SIGSEGV in the next test.The handler slept a fixed
2s, and the test fired N (1000/10000) calls at a 100-slot cap expecting exactly 100 to succeed. On a slow runner, submitting all N takes longer than 2s, so the first sleepy handlers finish and free slots mid-submit → more than the cap succeed. And because it's a fatalASSERT_, the failure skippedgate->Stop()/server.Stop(), leaving in-flight calls → the SIGSEGV in the following test.Fix — make it deterministic
SleepyEchoServicenow holds its concurrency slot until released (bumping anadmittedcounter on entry). Each test:admitted == 100),So precisely the cap's worth succeed, no extra slips into a freed slot, and nothing times out — independent of runner speed. Also switched
ASSERT_EQ→EXPECT_EQso a failure can't skip teardown.Verification
server_testcases pass; the downstream SIGSEGV is gone