TOT-SQL Safeguard submission#56
Conversation
|
Hi @NG-VikasV — thank you for your contribution! Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result. |
|
Hi @Ruiying-Ma, I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you. |
|
Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1. First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers? Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result. |
|
Hi, @Ruiying-Ma Thank you for the detailed feedback — apologies for the issues with the initial submission. Both points have been addressed: Please find the updated files committed directly to the PR branch: submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/ Direct link: Click here Best regards, Vikas |
Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No
A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.