eval-harness

Here are 38 public repositories matching this topic...

Virtue-Research / guard-eval-harness

One command to benchmark AI guardrails and coding agents across safety, security, jailbreak, prompt-injection, and secure-code tasks.

cli benchmark ai-safety guardrails llm-evaluation llm-safety safety-evaluation eval-harness

Updated Jun 26, 2026
Python

joctaTorres / ratchet

Star

BYOA Agent harness that ensures alignment end-to-end – with built-in eval system.

bdd specs byoa evals agent-orchestration agent-workflow agent-harness eval-harness harness-engineering

Updated Jun 28, 2026
TypeScript

Local Codex MCP harness: contracts, persistent RAG memory, raw traces, verification records, governance policy and PASS/FLAG/BLOCK audits, observability reports, harness profiles, eval runs, Meta-Harness-lite promotion records, natural-language harness specs, MCP resources/prompts, multi-client installer, and completion gates.

mcp observability governance codex ai-agents rag model-context-protocol agent-ops codex-cli agent-harness eval-harness skills-sh harness-engineering

Updated May 12, 2026
JavaScript

plaited / agent-eval-harness

Star

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

cli typescript grader ai-agents bun jsonl llm-evaluation agent-evaluation unix-pipeline agent-comparison trajectory-capture eval-harness pass-at-k headless-adapter

Updated Jun 17, 2026
TypeScript

ResonantIQ / resonantforge

Star

Deterministic synthetic two-party conversation corpus generator for testing AI scoring systems.

synthetic-data llm-evaluation customer-intelligence eval-harness corpus-generation

Updated Jun 24, 2026
Python

adityaanand0001 / healos-ai-agent

Star

LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.

typescript nextjs postgresql hono bun llm anthropic drizzle-orm eval-harness

Updated May 1, 2026
TypeScript

ganlin770 / promptlab

Star

A from-scratch hybrid rule + LLM-as-judge eval harness: pass-rate, judge-vs-rule Cohen's kappa, latency/cost, HTML report, CI gate. Runs offline.

evaluation ai-agent llm prompt-engineering llm-as-judge eval-harness

Updated Jun 17, 2026
Python

codychampion / llm-eval-workbench

Star

Production-minded LLM eval harness for safety, reliability, cost, and latency analysis.

python evaluation observability model-evaluation ai-safety red-teaming llm-evals eval-harness safety-evals

Updated May 25, 2026
Python

Kleptobyte / AGI-CK3

Star

Prototype adapter for CLI agents to play Crusader Kings III through a constrained, auditable CK3 mod bridge.

prototype ai-agents ck3 crusader-kings-3 eval-harness game-agents

Updated Jun 10, 2026
Python

2830500285 / omni-agent

Star

Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.

Updated May 16, 2026
TypeScript

OZ-50 / ozm-codex-agent-governance-skills

Star

Codex-native OZM skill pack for AI coding agent governance, agentic coding loops, claim ceilings, and AGENTS.md-aware workflows.

Updated May 30, 2026
Python

LueBangs-coder / nemesis-eval

Star

She who catches hubris in agents — a Python evaluation harness for agentic failure modes.

python testing evaluation ai-safety ai-agents llm eval-harness

Updated Jun 27, 2026
Python

qte77 / doc-pipeline-engine

Star

Document processing pipeline engine — adapters, contracts, domain packs, eval harnesses

python pipeline document-processing rag pydantic air-gapped document-ai contract-first llm eval-harness

Updated Jun 28, 2026
Python

agentsia-uk / assay-harness

Star

Open Agentsia Labs benchmark harness for model runners, multi-turn evals, reproducible rubric scoring, proof bundles, and RunRecord output.

benchmark adtech reproducibility llm-evaluation eval-harness frontier-ai proof-bundles open-benchmark assay-adtech assay-harness runrecord multi-turn-evals iab-tech-lab

Updated Jun 27, 2026
TypeScript

Zhenwu-C-Wang / agent-orchestrator

Star

A supervisor-driven multi-agent system where a central orchestrator decomposes tasks, delegates to specialized worker agents, and synthesizes final outputs. Designed for controllability, observability, and production workflows.

python multi-agent-systems local-first streamlit ollama structured-outputs agent-orchestration eval-harness

Updated Jun 17, 2026
Python

sarteta / whatsapp-rag-eval-kit

Star

YAML-driven evaluation harness for WhatsApp RAG bots

python yaml ai twilio chatbot evaluation whatsapp observability rag llm eval-harness

Updated Apr 29, 2026
Python

KarmaEnchanter / mental-health-llm-eval

Star

Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.

psychology cbt ai-safety conversational-ai clinical-ai cohen-kappa ollama llm-evaluation llm-as-judge mental-health-ai ai-eval inter-rater-reliability eval-harness lifeline-988 open-source-eval

Updated May 29, 2026
Python

Judysonnen / patchwise

Star

Tolerant apply_patch for LLM-generated diffs, plus an eval harness for code agents.

code-generation llm-agents eval-harness apply-patch

Updated May 6, 2026
Python

shashidharReddy866 / llm-evaluation-system

Star

Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.

nlp json-schema nextjs model-evaluation hono structured-output few-shot-learning ai-evaluation prompt-engineering anthropic llm-evaluation hallucination-detection llm-reliability eval-harness prompt-comparison

Updated May 1, 2026
TypeScript

neverSettles / opencua_hackathon

Star

Can Computer-Use Agents manage buy-side procurement operations? A benchmark across live e-commerce, with multi-model adapters (Northstar, OpenAI, Claude, Gemini), Kernel-hosted browsers, and Harbor/ATIF-v1.6 trajectory export.

benchmark kernel procurement cua lightcone computer-use-agent eval-harness tzafon-northstar openai-computer-use

Updated May 10, 2026
Python

Improve this page

Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval-harness

Here are 38 public repositories matching this topic...

Virtue-Research / guard-eval-harness

joctaTorres / ratchet

chapzin / codex-harness-mcp

plaited / agent-eval-harness

ResonantIQ / resonantforge

adityaanand0001 / healos-ai-agent

ganlin770 / promptlab

codychampion / llm-eval-workbench

Kleptobyte / AGI-CK3

2830500285 / omni-agent

OZ-50 / ozm-codex-agent-governance-skills

LueBangs-coder / nemesis-eval

qte77 / doc-pipeline-engine

agentsia-uk / assay-harness

Zhenwu-C-Wang / agent-orchestrator

sarteta / whatsapp-rag-eval-kit

KarmaEnchanter / mental-health-llm-eval

Judysonnen / patchwise

shashidharReddy866 / llm-evaluation-system

neverSettles / opencua_hackathon

Improve this page

Add this topic to your repo