Skip to content

Fix decentralized clients connection issue#1110

Open
ahzero7d1 wants to merge 9 commits into
developfrom
decentralized-connection-fix
Open

Fix decentralized clients connection issue#1110
ahzero7d1 wants to merge 9 commits into
developfrom
decentralized-connection-fix

Conversation

@ahzero7d1
Copy link
Copy Markdown
Collaborator

@ahzero7d1 ahzero7d1 commented Apr 15, 2026

Solves issue: #1109

This pull request tackles decentralized learning peer connection stability and synchronization issues. We are mainly solving 3 main issues.

Connection Ready Check and Signaling Weight Sharing

ConnectionsRetry, StartWeightSharing messages are newly added to prevent faster peers proceed to weight sharing while slower peers are still establishing connections. Detailed process is as follows:

  1. After completing peer connections, each client sends a ConnectionsReady message to the server.
  2. The server counts the number of received ConnectionsReady messages.
  3. Once the number of ConnectionsReady messages matches the number of expected participants for the round, the server sends a StartWeightSharing message to all round participants.
  4. Clients only start sharing model updates after receiving StartWeightSharing.

Connection Retries and Failed Client Disconnection

RetryPeerConnections, ConnectionFail messages are newly added. In addition, webapp shows error message when a client is disconnected after repetitive failure. maxConnectionRetry parameter is added to trainingInformation, and added to the webapp so that users can specify the value during task creation.
Connection failure handling process is as follows:

  1. If the number of retries is still below maxConnectionRetry, the server sends a RetryPeerConnection message to all peers in the current round.
  2. When clients receive RetryPeerConnection, they clean up their peer pool and aggregator nodes, then rerun the peer connection phase.
  3. If the connection setup still fails after more than maxConnectionRetry attempts, the server removes the failed peers from the round peer list.
  4. The server sends a ConnectionFail message to the failed peers.
  5. When a client receives ConnectionFail, it disconnects from the server.
  6. The remaining clients receive RetryPeerConnection and retry the peer connection phase without the failed peers.

Model Syncing for Participants Joining in the Middle of Training

Model syncing between peers is implemented for participants who are joining in the middle of training. ModelSyncRequest, SignalModelProvider, SignalNewPeer, SharedModel messages are newly added. In addition, ModelWeightAccess interface was added in disco.ts so that client object can adjust model weights when it receives the latest model from the provider peer.
Model syncing process is as follows:

  1. When a new participant joins in the middle of training, the server marks the participant as having joined mid-training in NewDecentralizedNodeInfo.
  2. After receiving NewDecentralizedNodeInfo, the new client sets a local flag indicating that it needs model syncing.
  3. When training begins, if this flag is set, the new client sends a ModelSyncRequest message to the server.
  4. After receiving ModelSyncRequest, the server sends messages as step 5 and 6, using selected model provider information from previous training round.
  5. The server sends SignalModelProvider to the new participant with information about the provider peer.
  6. The server sends SignalNewPeer to the provider peer with information about the newly joined peer.
  7. Using this signaling information, the new participant and provider peer establish a peer connection.
  8. The provider waits until the current aggregation round has finished, then sends the latest model to the new participant using a SharedModel message.
  9. The new participant receives the model and updates its local model weights.
  10. After syncing, the new participant can join subsequent decentralized training rounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant