This test scenario asking the model to explain codes fails in at least 40% of cases both via job_chat and global_chat on main. The model either emits an empty answer, three dots '...' or begins to explain code but ends the answer with a colon, with no code inline or in the suggested_code key.
I dug into this extensively but don't understand it. There are some known issues with using adaptive thinking and structured outputs, but there is probably something we are triggering inadvertently. It seems like there's too many constraints on the output and the model decides to stop the turn.
I suspected that the ordering with code before text wasn't the most natural solution. I tried reversing it, and the problem does still appear (at least, cutting off on a colon: "...Here's an example that adds multiple members to your list in a single batch:", 'suggested_code': ""), but it might happen less frequently (1 in 10 tests).
hanna@Hannas-MacBook-Pro-2 apollo % poetry run pytest services/job_chat/tests/acceptance/tmp/test_repro_dots_response.md -s
=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.11.15, pytest-8.4.1, pluggy-1.6.0
codspeed: 3.2.0 (disabled, mode: walltime, timer_resolution: 41.7ns)
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/hanna/openfn/apollo
configfile: pyproject.toml
plugins: recording-0.13.4, anyio-4.9.0, syrupy-4.9.1, socket-0.7.0, langsmith-0.4.1, codspeed-3.2.0, benchmark-5.1.0, asyncio-0.26.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item
services/job_chat/tests/acceptance/tmp/test_repro_dots_response.md
→ job-chat.tmp.repro-dots-response
service: job_chat
judges: general, openfn_code_quality
calling job_chat...
Calling services/job_chat ...
INFO:search_adaptor_docs:Checking/loading adaptor docs for @openfn/language-mailchimp@latest
INFO:load_adaptor_docs:Checking if docs already exist in database
INFO:load_adaptor_docs:✓ Docs already exist for @openfn/language-mailchimp@latest (27 functions)
INFO:search_adaptor_docs:Adaptor docs for @openfn/language-mailchimp@latest already exist (checked in 0.008s)
INFO:job_chat:Making non-streaming API call
INFO:job_chat:model output: stop_reason='end_turn' content_block_types=['thinking', 'text'] content_block_lengths=[479, 40]
INFO:job_chat:Cache write: 533 tokens
INFO:job_chat:Cache read: 2930 tokens
INFO:job_chat:parse_and_apply_edits: text_answer_len=3 code_edits_count=0 actions=[] has_original=True
Done!
{'response': '...', 'suggested_code': None, 'history': [{'role': 'assistant', 'content': "\nHi! I'm here to help you create an OpenFn workflow. Could you describe the process you'd like to automate? For example, tell me what systems are involved, what data needs to move between them, and when or how the workflow should be triggered."}, {'role': 'user', 'content': '\n[pg:workflows/daily-google-sheets-email-sender] i want this to send an email to my group 1234 whre the title and body are just green emojis'}, {'role': 'assistant', 'content': '\nThis job creates a Mailchimp campaign for list 1234 with a green emoji subject and body, then sends it. Make sure to update from_name and reply_to with your actual sender details — these must match a verified sender in your Mailchimp account.'}, {'role': 'user', 'content': '\n[pg:workflows/daily-google-sheets-email-sender/update-mailchimp] what does the ... do'}, {'role': 'assistant', 'content': "\nThe ... is JavaScript's spread operator. In { ...state, campaignId: state.data.id }, it copies all existing properties from state into the new object, then adds/overwrites campaignId. Without it, you'd lose everything else in state (like data, configuration, etc.) and only have campaignId."}, {'role': 'user', 'content': '[pg:job_code/update-mailchimp/mailchimp@latest] \nokay can you show me how to use the hardest function for htis adaptor to teach me'}, {'role': 'assistant', 'content': '...'}], 'usage': {'cache_creation_input_tokens': 533, 'cache_read_input_tokens': 2930, 'input_tokens': 733, 'output_tokens': 167}, 'meta': {'rag': {'search_results': [], 'search_results_sections': [], 'search_queries': [], 'config_version': 1.0, 'prompts_version': 1.0, 'usage': {'needs_docs': {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'inference_geo': 'global', 'input_tokens': 730, 'output_tokens': 4, 'server_tool_use': None, 'service_tier': 'standard'}, 'generate_queries': {}}}}}
This test scenario asking the model to explain codes fails in at least 40% of cases both via
job_chatandglobal_chatonmain. The model either emits an empty answer, three dots '...' or begins to explain code but ends the answer with a colon, with no code inline or in thesuggested_codekey.I dug into this extensively but don't understand it. There are some known issues with using adaptive thinking and structured outputs, but there is probably something we are triggering inadvertently. It seems like there's too many constraints on the output and the model decides to stop the turn.
I suspected that the ordering with code before text wasn't the most natural solution. I tried reversing it, and the problem does still appear (at least, cutting off on a colon: "...Here's an example that adds multiple members to your list in a single batch:", 'suggested_code': ""), but it might happen less frequently (1 in 10 tests).
I added the test in
services/global_chat/tests/acceptance/bugs/test_repro_dots_response.mdandservices/job_chat/tests/acceptance/bugs/test_repro_dots_response.mdinglobal-chat-job-codein #495