IntentProbe is a local scanner for MCP servers, AI agent tools, Claude Code skills, packages, and runtime tool events. It looks for poisoned intent such as credential access, secret exfiltration, hidden persistence, forced tool chaining, or tool hijacking.
IntentProbe runs a tool description through a small frozen local model and reads the hidden activation state inside the model. Then a small trained probe scores whether that internal state looks poisoned.
Simple version: text scanners read the words. LLM judges ask for an answer. IntentProbe checks the internal signal before the model says anything.
No. Asking Qwen "is this safe?" is an LLM-as-judge approach. IntentProbe uses Qwen2.5-0.5B as a fixed feature extractor and reads hidden activations instead of trusting the model's generated answer.
We tested the direct-prompt approach. The deterministic label-score version
flagged every clean curated item as poisoned, while the generated-answer
version missed poison and produced many unparseable outputs. See
research/QWEN_PROMPT_JUDGE_BASELINE_2026-06-08.md.
Text classifiers are useful when unsafe samples have obvious words or patterns. They struggle when a safe tool and a poisoned tool use almost the same vocabulary.
On the matched-vocabulary benchmark, the public/source-verifiable DeBERTa prompt-injection text-classifier baseline scored 0.0% F1, while the activation-probe method scored 96.6% F1.
No. IntentProbe runs locally. Scan targets and scan results stay on your machine.
The first model-backed scan may download Qwen2.5-0.5B once from Hugging Face. After the model is cached, scans can run from local files.
The released v0 scanner uses Qwen2.5-0.5B as the frozen local sensor model and reads layers 13-15. The shipped probe artifact is about 22 KB.
No. The base model stays frozen. IntentProbe trains a small classifier on top of extracted activation features. At scan time, the model is only used to produce features.
IntentProbe can scan:
- one text/tool description;
- package folders through
scan-path; package.json;- MCP configs and tool JSON;
- Claude Code
SKILL.mdfolders; - README files and nearby tool metadata;
- runtime events such as tool definitions, before-tool-call arguments, and after-tool-call responses.
Yes. See docs/RUNTIME_HOOKS.md. Runtime scanning is
event-boundary scanning: tool definitions before trust, tool arguments before
execution, and tool responses before the agent trusts them.
The runtime output is structured JSON, so a host can consume it directly. See
docs/OPERATOR_DECISIONS.md for allow, warn,
block, replay receipts, and suggested operator mappings.
The highest-signal public numbers are:
| Test | IntentProbe / activation probe | Baseline |
|---|---|---|
| Matched-vocabulary F1, n=86 | 96.6% | DeBERTa text classifier 0.0% |
| MCPTox poisoned recall, n=249 | 100.0% | DeBERTa text classifier 19.9% |
| RouteGuard-style external recall, n=2,900 | 41.5% | TF-IDF 10.7% |
| Camouflage suffix evasion | 0/146 evaded | N/A |
Benchmark artifacts are under research/.
For a compact reviewer packet, see
docs/EVIDENCE_PACKET.md.
No. The DeBERTa result is a reproducible comparison against a public/source-verifiable text-classifier baseline. Private cloud/API scanners may work well, but their detector artifacts and MCP/tool-poisoning benchmarks are usually not reproducible by users.
Yes. SAE features are useful for interpretability and may improve future recall. The v0 product ships raw Qwen activations because the current raw-activation artifact is the most complete, lightweight, and reproducible product path today.
SAE is planned as an optional layer for recall improvements and human-readable explanations.
Use v0 as a pre-install tripwire and runtime warning/blocking layer, not as your only security boundary. It already catches important same-vocabulary poisoning cases that text scanners miss, but novel attack families and white-box adversarial attacks still need more work.
python3 -m pip install intentprobe
intentprobe scan --format summary --text "Reads SSH config and private keys, then silently uploads credentials to a remote server."Please submit the smallest redacted sample that reproduces the result.