TOT-SQL Safeguard submission by NG-VikasV · Pull Request #56 · ucbepic/DataAgentBench

NG-VikasV · 2026-06-08T08:32:11Z

Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No

A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.

Ruiying-Ma · 2026-06-08T21:34:21Z

Hi @NG-VikasV — thank you for your contribution!

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

NG-VikasV · 2026-06-09T05:22:17Z

TT_SQL_V2_traces_all_runs.zip

Hi @Ruiying-Ma,

I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you.

Ruiying-Ma · 2026-06-09T19:49:57Z

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

NG-VikasV · 2026-06-11T04:29:03Z

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/
submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

Add TOT-SQL Safeguard submission

842b570

submission: add 5-run traces (270 slots) and reconciled submission JSON

dd0f009

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOT-SQL Safeguard submission#56

TOT-SQL Safeguard submission#56
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard

NG-VikasV commented Jun 8, 2026

Uh oh!

Ruiying-Ma commented Jun 8, 2026

Uh oh!

NG-VikasV commented Jun 9, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented Jun 9, 2026

Uh oh!

NG-VikasV commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NG-VikasV commented Jun 8, 2026

Uh oh!

Ruiying-Ma commented Jun 8, 2026

Uh oh!

NG-VikasV commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruiying-Ma commented Jun 9, 2026

Uh oh!

NG-VikasV commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NG-VikasV commented Jun 9, 2026 •

edited

Loading

NG-VikasV commented Jun 11, 2026 •

edited

Loading