Skip to content

ganlin770/tool-smith

Repository files navigation

tool-smith

LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.

Python PEFT MCP Apple Silicon License CI

The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.

Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.


The result (real, from python -m toolsmith.eval)

Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:

metric (all 97) base LoRA-tuned Δ
valid JSON 95.9% 100.0% +4.1
schema-valid call 19.6% 94.8% +75.2
correct tool 36.1% 85.6% +49.5
exact args 4.1% 74.2% +70.1
fully correct 4.1% 74.2% +70.1

On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

base vs tuned

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.

Generalization to hand-written (non-templated) inputs

The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:

metric (12 hand-written) base tuned Δ
schema-valid 25.0% 100.0% +75.0
correct tool 41.7% 83.3% +41.6
fully correct 8.3% 58.3% +50.0

The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.

The training run (real loss curve)

LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

training loss

Quickstart

pip install -e .                                   # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt              # torch/transformers/peft/trl/... for training

python -m toolsmith.data.build      # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train           # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval            # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo    # offline recovery demo -> logs/run-demo.jsonl

Serve the tuned model over MCP

python -m toolsmith.mcp_server      # stdio; exposes route_to_tool(request) + the 8 tools
# or containerized (installs the inference stack, pulls the base model on first run):
docker build -t tool-smith . && docker run --rm -i tool-smith

mcp.json for Claude Desktop / Cursor:

{
  "mcpServers": {
    "tool-smith": {
      "command": "python",
      "args": ["-m", "toolsmith.mcp_server"],
      "cwd": "/path/to/tool-smith"
    }
  }
}

route_to_tool("What's the weather in Tokyo?"){"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.

The agent loop (validation + recovery + observability)

agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:

ok=True recovery=none              | What's the weather in Tokyo?        (tuned, 1 attempt)
ok=True recovery=repair_retry      | Pack for Berlin? ... rain there.    (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds...     (base failed x3 -> fallback router)

python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.

Layout

toolsmith/
  schema.py        # the fixed 8-tool toolbox + JSON validator (one source of truth)
  data/build.py    # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
  train.py         # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
  router.py        # load base (+adapter) and turn a request into a tool-call string
  eval.py          # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
  grade.py         # exact tool/args grading (no LLM judge needed for routing)
  mcp_server.py    # FastMCP: route_to_tool + 8 mock tools
  agent.py         # validate / repair-retry / frontier-fallback loop + JSONL logging
  logs_report.py   # summarize agent runs
data/              # committed train/test jsonl
artifacts/         # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/              # committed real agent traces
tests/             # pytest (grading + agent recovery), model-free

Limitations & next steps

Stated plainly — knowing the limits is part of the work:

  • Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
  • Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form fully_correct is 58% vs 74% templated.
  • Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
  • Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
  • Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
  • Every number here comes from an actual local run, regenerable (fixed seed; requirements-train.txt pins the exact stack). No placeholder figures.

License

MIT

About

LoRA-fine-tune a 0.5B model into a JSON tool-call router, serve it over MCP, prove the lift with a base-vs-tuned eval, wrap it in an agent loop with failure recovery. Teaching-scale, real artifacts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors