One command to benchmark AI guardrails and coding agents across safety, security, jailbreak, prompt-injection, and secure-code tasks.
-
Updated
Jun 26, 2026 - Python
One command to benchmark AI guardrails and coding agents across safety, security, jailbreak, prompt-injection, and secure-code tasks.
BYOA Agent harness that ensures alignment end-to-end – with built-in eval system.
Local Codex MCP harness: contracts, persistent RAG memory, raw traces, verification records, governance policy and PASS/FLAG/BLOCK audits, observability reports, harness profiles, eval runs, Meta-Harness-lite promotion records, natural-language harness specs, MCP resources/prompts, multi-client installer, and completion gates.
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
Deterministic synthetic two-party conversation corpus generator for testing AI scoring systems.
LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.
A from-scratch hybrid rule + LLM-as-judge eval harness: pass-rate, judge-vs-rule Cohen's kappa, latency/cost, HTML report, CI gate. Runs offline.
Production-minded LLM eval harness for safety, reliability, cost, and latency analysis.
Prototype adapter for CLI agents to play Crusader Kings III through a constrained, auditable CK3 mod bridge.
Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.
Codex-native OZM skill pack for AI coding agent governance, agentic coding loops, claim ceilings, and AGENTS.md-aware workflows.
She who catches hubris in agents — a Python evaluation harness for agentic failure modes.
Document processing pipeline engine — adapters, contracts, domain packs, eval harnesses
Open Agentsia Labs benchmark harness for model runners, multi-turn evals, reproducible rubric scoring, proof bundles, and RunRecord output.
A supervisor-driven multi-agent system where a central orchestrator decomposes tasks, delegates to specialized worker agents, and synthesizes final outputs. Designed for controllability, observability, and production workflows.
YAML-driven evaluation harness for WhatsApp RAG bots
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
Tolerant apply_patch for LLM-generated diffs, plus an eval harness for code agents.
Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.
Can Computer-Use Agents manage buy-side procurement operations? A benchmark across live e-commerce, with multi-model adapters (Northstar, OpenAI, Claude, Gemini), Kernel-hosted browsers, and Harbor/ATIF-v1.6 trajectory export.
Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.
To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."