Question
I built a chatbot server using VLLM's AsyncLLMEngine and an MCP Streamable HTTP server via its Python SDK.
To stream the token generation process from the AsyncLLMEngine generator to the MCP client, I used context.report_progress as shown below.
...
foundation_model = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(
...
)
)
...
@server.tool()
async def chat(q:str, ctx: Context):
messages = get_messages(q)
generator = foundation_model.generate(messages, ...)
async for request_output in generator:
current_text = request_output.outputs[0].text
next_text = current_text[len(answer):]
if ctx is not None and isinstance(ctx, Context):
await ctx.report_progress(i, None, next_text)
answer = current_text
i += 1
Is this usage appropriate for streaming the token generation process?
Is there another way?
Additional Context
No response
Question
I built a chatbot server using VLLM's
AsyncLLMEngineand an MCP Streamable HTTP server via its Python SDK.To stream the token generation process from the
AsyncLLMEnginegenerator to the MCP client, I usedcontext.report_progressas shown below.Is this usage appropriate for streaming the token generation process?
Is there another way?
Additional Context
No response