Skip to content

Cancel monitor and server tasks on backend shutdown#669

Open
Fiona-Waters wants to merge 1 commit intoOpenPipe:mainfrom
Fiona-Waters:fix/cancel-server-task-on-shutdown
Open

Cancel monitor and server tasks on backend shutdown#669
Fiona-Waters wants to merge 1 commit intoOpenPipe:mainfrom
Fiona-Waters:fix/cancel-server-task-on-shutdown

Conversation

@Fiona-Waters
Copy link
Copy Markdown

@Fiona-Waters Fiona-Waters commented Apr 30, 2026

Summary

  • Store _monitor_openai_server task references in LocalBackend and cancel them during close()/_close()
  • Store openai_server_task reference in UnslothService and cancel it during aclose()/close()
  • Iterate over list() copies to avoid RuntimeError from dict mutation when done_callbacks fire during await
  • Add unit test verifying close() cancels monitor tasks

Fixes ##668

Test plan

  • Existing test_local_backend_monitor.py test passes
  • New test_close_cancels_monitor_tasks test passes
  • Run LocalBackend training in shared mode inside a container — pod should exit cleanly after training completes

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Two fire-and-forget asyncio tasks were never cancelled during shutdown:

1. `_monitor_openai_server` in `LocalBackend` — polls vLLM health every
   30s. After 3 consecutive ConnectionRefusedError failures (because vLLM
   is already gone), it re-raises, producing "Task exception was never
   retrieved" and a non-zero process exit.

2. `openai_server_task` in `UnslothService` (shared mode) — the uvicorn
   server task, also orphaned on shutdown.

In containerized environments (Kubeflow, KubeRay), the non-zero exit
causes pod restarts even though training completed successfully.

Fix: store both task references and cancel them during close()/aclose().
Iterate over list() copies to avoid RuntimeError from dict mutation when
done_callbacks fire during await.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant