Skip to content

Job_chat: Model outputs an empty or partial response #497

@hanna-paasivirta

Description

@hanna-paasivirta

This test scenario asking the model to explain codes fails in at least 40% of cases both via job_chat and global_chat on main. The model either emits an empty answer, three dots '...' or begins to explain code but ends the answer with a colon, with no code inline or in the suggested_code key.

I dug into this extensively but don't understand it. There are some known issues with using adaptive thinking and structured outputs, but there is probably something we are triggering inadvertently. It seems like there's too many constraints on the output and the model decides to stop the turn.

I suspected that the ordering with code before text wasn't the most natural solution. I tried reversing it, and the problem does still appear (at least, cutting off on a colon: "...Here's an example that adds multiple members to your list in a single batch:", 'suggested_code': ""), but it might happen less frequently (1 in 10 tests).

I added the test in services/global_chat/tests/acceptance/bugs/test_repro_dots_response.md and services/job_chat/tests/acceptance/bugs/test_repro_dots_response.md in global-chat-job-code in #495

hanna@Hannas-MacBook-Pro-2 apollo % poetry run pytest services/job_chat/tests/acceptance/tmp/test_repro_dots_response.md -s
=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.11.15, pytest-8.4.1, pluggy-1.6.0
codspeed: 3.2.0 (disabled, mode: walltime, timer_resolution: 41.7ns)
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/hanna/openfn/apollo
configfile: pyproject.toml
plugins: recording-0.13.4, anyio-4.9.0, syrupy-4.9.1, socket-0.7.0, langsmith-0.4.1, codspeed-3.2.0, benchmark-5.1.0, asyncio-0.26.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item

services/job_chat/tests/acceptance/tmp/test_repro_dots_response.md
→ job-chat.tmp.repro-dots-response
  service: job_chat
  judges:  general, openfn_code_quality
  calling job_chat...
Calling services/job_chat ...

INFO:search_adaptor_docs:Checking/loading adaptor docs for @openfn/language-mailchimp@latest
INFO:load_adaptor_docs:Checking if docs already exist in database
INFO:load_adaptor_docs:✓ Docs already exist for @openfn/language-mailchimp@latest (27 functions)
INFO:search_adaptor_docs:Adaptor docs for @openfn/language-mailchimp@latest already exist (checked in 0.008s)
INFO:job_chat:Making non-streaming API call
INFO:job_chat:model output: stop_reason='end_turn' content_block_types=['thinking', 'text'] content_block_lengths=[479, 40]
INFO:job_chat:Cache write: 533 tokens
INFO:job_chat:Cache read: 2930 tokens
INFO:job_chat:parse_and_apply_edits: text_answer_len=3 code_edits_count=0 actions=[] has_original=True

Done!
{'response': '...', 'suggested_code': None, 'history': [{'role': 'assistant', 'content': "\nHi! I'm here to help you create an OpenFn workflow. Could you describe the process you'd like to automate? For example, tell me what systems are involved, what data needs to move between them, and when or how the workflow should be triggered."}, {'role': 'user', 'content': '\n[pg:workflows/daily-google-sheets-email-sender] i want this to send an email to my group 1234  whre the title and body are just green emojis'}, {'role': 'assistant', 'content': '\nThis job creates a Mailchimp campaign for list 1234 with a green emoji subject and body, then sends it. Make sure to update from_name and reply_to with your actual sender details — these must match a verified sender in your Mailchimp account.'}, {'role': 'user', 'content': '\n[pg:workflows/daily-google-sheets-email-sender/update-mailchimp] what does the ... do'}, {'role': 'assistant', 'content': "\nThe ... is JavaScript's spread operator. In { ...state, campaignId: state.data.id }, it copies all existing properties from state into the new object, then adds/overwrites campaignId. Without it, you'd lose everything else in state (like data, configuration, etc.) and only have campaignId."}, {'role': 'user', 'content': '[pg:job_code/update-mailchimp/mailchimp@latest] \nokay can you show me how to use the hardest function for htis adaptor to teach me'}, {'role': 'assistant', 'content': '...'}], 'usage': {'cache_creation_input_tokens': 533, 'cache_read_input_tokens': 2930, 'input_tokens': 733, 'output_tokens': 167}, 'meta': {'rag': {'search_results': [], 'search_results_sections': [], 'search_queries': [], 'config_version': 1.0, 'prompts_version': 1.0, 'usage': {'needs_docs': {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'inference_geo': 'global', 'input_tokens': 730, 'output_tokens': 4, 'server_tool_use': None, 'service_tier': 'standard'}, 'generate_queries': {}}}}}

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions