fix: Install Consul without Windows service to avoid dev agent conflict#77
Conversation
The chocolatey package registers Consul as a Windows service which races with the dev agent for port binding. Stop and disable the service before starting the dev agent, and add a readiness check that polls the leader endpoint to confirm the agent is ready. Co-Authored-By: mkeeler@launchdarkly.com <keelerm84@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Point the Windows persistent-stores action at the commit that stops the Consul Windows service before starting the dev agent, replacing the local workaround. See: launchdarkly/gh-actions#77 Co-Authored-By: mkeeler@launchdarkly.com <keelerm84@gmail.com>
| @@ -82,8 +82,34 @@ runs: | |||
| # 1.20.1 until they fix the issue | |||
| run: | | |||
| choco install consul --version=1.20.1 | |||
There was a problem hiding this comment.
This should allow you to get rid of the next Stop-Service command.
| choco install consul --version=1.20.1 | |
| choco install consul --version=1.20.1 --params="'/noservice'" |
There was a problem hiding this comment.
There was a problem hiding this comment.
Good call -- done in 3352941. Switched to choco install consul --version=1.20.1 --params="'/noservice'" and removed the Stop-Service/Set-Service block, so the service is never registered rather than registered-then-disabled. Kept the /v1/status/leader readiness poll.
|
❌ Cannot revive Devin session - the session is too old. Please start a new session instead. |
Use the chocolatey /noservice package parameter so Consul is never registered as a Windows service, instead of registering it and then stopping and disabling it. This removes the service that raced with the dev agent for the port, so the Stop-Service and Set-Service steps are no longer needed.
The dev agent elects itself leader effectively instantly, so the poll is not waiting on slow Raft election. Reword the comment to reflect its actual purpose: Start-Process returns immediately, so we wait until the HTTP API is up and serving before tests run.
| # Start-Process returns immediately, so wait until the dev agent's HTTP | ||
| # API is up and serving (leader elected) before tests run. | ||
| $timeout = 30 | ||
| $elapsed = 0 | ||
| while ($elapsed -lt $timeout) { | ||
| try { | ||
| $response = Invoke-RestMethod -Uri 'http://127.0.0.1:${{ inputs.consul_port }}/v1/status/leader' -TimeoutSec 2 | ||
| if ($response -and $response -ne '""') { | ||
| Write-Host "Consul is ready (leader: $response)" | ||
| break | ||
| } | ||
| } catch {} | ||
| Write-Host "Waiting for Consul to elect a leader... ($elapsed/$timeout seconds)" | ||
| Start-Sleep -Seconds 1 | ||
| $elapsed++ | ||
| } | ||
| if ($elapsed -eq $timeout) { | ||
| Write-Error "Consul did not become ready within $timeout seconds" | ||
| exit 1 | ||
| } |
There was a problem hiding this comment.
The readiness check can be simplified. In dev mode Consul immediately self-elects as leader, so the real intent here is just "is the agent's HTTP API up" — no leader polling needed. consul members (already on PATH after install) exits non-zero until the agent responds, making it a self-describing check without any HTTP request construction or JSON parsing.
| # Start-Process returns immediately, so wait until the dev agent's HTTP | |
| # API is up and serving (leader elected) before tests run. | |
| $timeout = 30 | |
| $elapsed = 0 | |
| while ($elapsed -lt $timeout) { | |
| try { | |
| $response = Invoke-RestMethod -Uri 'http://127.0.0.1:${{ inputs.consul_port }}/v1/status/leader' -TimeoutSec 2 | |
| if ($response -and $response -ne '""') { | |
| Write-Host "Consul is ready (leader: $response)" | |
| break | |
| } | |
| } catch {} | |
| Write-Host "Waiting for Consul to elect a leader... ($elapsed/$timeout seconds)" | |
| Start-Sleep -Seconds 1 | |
| $elapsed++ | |
| } | |
| if ($elapsed -eq $timeout) { | |
| Write-Error "Consul did not become ready within $timeout seconds" | |
| exit 1 | |
| } | |
| # consul members exits non-zero until the agent's HTTP API responds. | |
| # In dev mode the leader is self-elected immediately on startup, so | |
| # this is purely a process-readiness check. | |
| $deadline = (Get-Date).AddSeconds(30) | |
| $ready = $false | |
| while ((Get-Date) -lt $deadline) { | |
| consul members -http-addr="http://127.0.0.1:${{ inputs.consul_port }}" 2>&1 | Out-Null | |
| if ($LASTEXITCODE -eq 0) { $ready = $true; break } | |
| Start-Sleep -Seconds 1 | |
| } | |
| if (-not $ready) { | |
| Write-Error "Consul did not become ready within 30 seconds" | |
| exit 1 | |
| } |
There was a problem hiding this comment.
Agreed, that's cleaner -- applied in dcd7d2f. Swapped the /v1/status/leader HTTP poll for the consul members exit-code check, so no HTTP/JSON handling and it reads as a plain process-readiness gate.
jsonbailey
left a comment
There was a problem hiding this comment.
Left a comment on an alternative check that doesn't need the try catch but won't block on it as I haven't tested it.
Replace the /v1/status/leader HTTP poll with consul members, which exits non-zero until the agent's HTTP API responds. This drops the HTTP request construction and JSON parsing in favor of a self-describing check using a binary already on PATH after install.
The consul members readiness check from #77 fails the step on its first iteration. While Consul is still starting, consul members writes a connection-refused message to stderr; the 2>&1 merge under the Actions PowerShell wrapper ($ErrorActionPreference=Stop) surfaces that as a NativeCommandError and terminates the step before the retry loop can run. Restore the previous /v1/status/leader HTTP poll, where Invoke-RestMethod raises a catchable exception that the try/catch swallows so the loop retries until the agent is ready. The /noservice install change from #77 is unaffected.
Summary
The
choco install consulcommand registers Consul as a Windows service, which races with the subsequentStart-Process consul agent -devfor port binding. When the service wins the race and binds to the configured port first, the dev agent silently fails to start, causing all Consul integration tests in downstream repos to fail with500 No known Consul servers.This was observed as a flaky failure in python-server-sdk#411 — only
windows (3.14)lost the race on that run, but the issue is non-deterministic and affects all Windows matrix entries.This PR:
/noservicepackage parameter so the Windows service is never registered, eliminating the process that raced with the dev agent for the port./v1/status/leader(up to 30s) so the step only completes once Consul has elected a leader and is actually ready to serve requests.Review & Testing Checklist for Human
/noserviceparameter is honored by the consul 1.20.1 package so no Windows service is registered (and therefore nothing competes for the port).runner.os == 'Windows'conditional, but worth a glance.Notes
-ErrorAction SilentlyContinueonStop-Service/Set-Serviceis intentional: if a future chocolatey package version stops registering a Windows service, these calls will no-op gracefully instead of failing.Link to Devin session: https://app.devin.ai/sessions/218fb7de91864819adbc8746029f7331
Requested by: @keelerm84