Fix decentralized clients connection issue by ahzero7d1 · Pull Request #1110 · epfml/disco

ahzero7d1 · 2026-04-15T14:01:39Z

Solves issue: #1109

This pull request tackles decentralized learning peer connection stability and synchronization issues. We are mainly solving 3 main issues.

Connection Ready Check and Signaling Weight Sharing

ConnectionsRetry, StartWeightSharing messages are newly added to prevent faster peers proceed to weight sharing while slower peers are still establishing connections. Detailed process is as follows:

After completing peer connections, each client sends a ConnectionsReady message to the server.
The server counts the number of received ConnectionsReady messages.
Once the number of ConnectionsReady messages matches the number of expected participants for the round, the server sends a StartWeightSharing message to all round participants.
Clients only start sharing model updates after receiving StartWeightSharing.

Connection Retries and Failed Client Disconnection

RetryPeerConnections, ConnectionFail messages are newly added. In addition, webapp shows error message when a client is disconnected after repetitive failure. maxConnectionRetry parameter is added to trainingInformation, and added to the webapp so that users can specify the value during task creation.
Connection failure handling process is as follows:

If the number of retries is still below maxConnectionRetry, the server sends a RetryPeerConnection message to all peers in the current round.
When clients receive RetryPeerConnection, they clean up their peer pool and aggregator nodes, then rerun the peer connection phase.
If the connection setup still fails after more than maxConnectionRetry attempts, the server removes the failed peers from the round peer list.
The server sends a ConnectionFail message to the failed peers.
When a client receives ConnectionFail, it disconnects from the server.
The remaining clients receive RetryPeerConnection and retry the peer connection phase without the failed peers.

Model Syncing for Participants Joining in the Middle of Training

Model syncing between peers is implemented for participants who are joining in the middle of training. ModelSyncRequest, SignalModelProvider, SignalNewPeer, SharedModel messages are newly added. In addition, ModelWeightAccess interface was added in disco.ts so that client object can adjust model weights when it receives the latest model from the provider peer.
Model syncing process is as follows:

When a new participant joins in the middle of training, the server marks the participant as having joined mid-training in NewDecentralizedNodeInfo.
After receiving NewDecentralizedNodeInfo, the new client sets a local flag indicating that it needs model syncing.
When training begins, if this flag is set, the new client sends a ModelSyncRequest message to the server.
After receiving ModelSyncRequest, the server sends messages as step 5 and 6, using selected model provider information from previous training round.
The server sends SignalModelProvider to the new participant with information about the provider peer.
The server sends SignalNewPeer to the provider peer with information about the newly joined peer.
Using this signaling information, the new participant and provider peer establish a peer connection.
The provider waits until the current aggregation round has finished, then sends the latest model to the new participant using a SharedModel message.
The new participant receives the model and updates its local model weights.
After syncing, the new participant can join subsequent decentralized training rounds.

…ing training

ahzero7d1 and others added 9 commits April 15, 2026 14:52

Add ConnectionsReady, StartWeightSharing messages

2900c65

Update gitignore

dd22a7d

Fix lint error

7d0e634

Add server timeout and retry handling for peer connection failures

947d2c6

Merge branch 'develop' into decentralized-connection-fix

18e777e

Implement decentralized model synchronization for clients joining dur…

b443f53

…ing training

Improve model syncing and add max retry parameter

9c7c488

Fix promise returning problem during SignalNewPeer message processing

09ced04

Specify event type

14b2a4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decentralized clients connection issue#1110

Fix decentralized clients connection issue#1110
ahzero7d1 wants to merge 9 commits into
developfrom
decentralized-connection-fix

ahzero7d1 commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ahzero7d1 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connection Ready Check and Signaling Weight Sharing

Connection Retries and Failed Client Disconnection

Model Syncing for Participants Joining in the Middle of Training

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ahzero7d1 commented Apr 15, 2026 •

edited

Loading