How Well Do Agentic Skills Work in the Wild

Code and data for "How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings".

Overview

This repository provides a framework for evaluating skill utility under progressively realistic conditions: from hand-curated skills to retrieval from a noisy pool of 34k real-world skills.

Key findings:

Skill benefits degrade consistently as settings become more realistic
Agents struggle to select, retrieve, and adapt skills
Query-specific refinement substantially recovers lost performance

We evaluate on SkillsBench and Terminal-Bench 2.0 across three models: Claude Opus 4.6, Kimi K2.5, and Qwen3.5-397B.

Setup

git clone https://github.com/UCSB-NLP-Chang/Skill-Usage.git
cd Skill-Usage

# Create environment
conda create -n skillusage python=3.12 -y
conda activate skillusage

# Install dependencies
pip install uv huggingface_hub pyyaml numpy
uv tool install --force harbor@git+https://github.com/yujianll/harbor@main

Requirements:

Docker (for running tasks in isolated containers)
API access for the model you want to evaluate

Copy .env.example to .envrc and fill in your keys. Use direnv to load them automatically.

Repository Structure

.
├── tasks/                      # SkillsBench base tasks (84 tasks)
├── terminal-bench-2/           # Terminal-Bench 2.0 base tasks (89 tasks)
├── skills/                     # 34k real-world skill collection (download from HF)
├── search_server/              # Skill search engine (BM25 + semantic + hybrid)
│   └── index/                  # Pre-built search index (download from HF)
├── scripts/                    # Experiment scripts
│   ├── prepare_experiment.py           # Download data & generate configs
│   ├── retrieve_skills.py              # Agentic skill retrieval
│   ├── query_specific_refinement.py    # Query-specific skill refinement
│   ├── query_agnostic_refinement.py    # Query-agnostic skill refinement
│   ├── calculate_results.py            # Aggregate results
│   ├── check_skill_usage.py            # Analyze skill usage in trajectories
│   └── ...
├── experiments/configs/        # Harbor run configurations
└── data/                       # Retrieval data and mappings

Option 1: Reproduce Paper Results

Use scripts/prepare_experiment.py to download pre-computed experiment data from our HuggingFace dataset, build task folders, and generate a Harbor config — all in one command.

python scripts/prepare_experiment.py \
    --benchmark skillsbench \
    --setting curated \
    --model claude

The script prints the Harbor command to run at the end, e.g.:

harbor run -c experiments/configs/generated/skillsbench/curated-claude.yaml

Available Settings

SkillsBench (from most idealized to most realistic):

Setting	Description
`force_loaded`	All curated skills force-loaded into context (upper bound)
`curated`	Original curated skills, agent decides which to load
`curated_w_distractors`	Curated skills + distractor skills from 34k pool
`no_skills`	No skills provided (baseline)
`retrieved_w_curated`	Agent retrieves top-5 from 34k pool (curated included)
`retrieved_wo_curated`	Agent retrieves top-5 from 34k pool (curated removed)
`retrieved_w_curated_query_specific`	Retrieved + query-specific refinement
`retrieved_w_curated_query_agnostic`	Retrieved + query-agnostic refinement
`retrieved_wo_curated_query_specific`	Retrieved (w/o curated) + query-specific refinement
`retrieved_wo_curated_query_agnostic`	Retrieved (w/o curated) + query-agnostic refinement

Terminal-Bench 2.0:

Setting	Description
`base`	No skills (baseline)
`retrieved`	Agent retrieves top-5 from 34k pool
`retrieved_query_specific`	Retrieved + query-specific refinement
`retrieved_query_agnostic`	Retrieved + query-agnostic refinement

Models: claude (Claude Code), kimi (Terminus-2), qwen (Qwen-Code)

Additional Options

python scripts/prepare_experiment.py --help

Flag	Default	Description
`--output-dir`	auto	Output directory for prepared tasks
`--config-dir`	`experiments/configs/generated`	Where to write the config YAML
`--concurrency`	8	Number of concurrent trials
`--port`	30000	Port for local model API server (kimi/qwen)

Calculate Results

After a Harbor run completes:

# Single job
python scripts/calculate_results.py skillsbench-trajectories/jobs/<job_name>

# All jobs in a folder
python scripts/calculate_results.py skillsbench-trajectories/jobs --batch

Option 2: Full Pipeline Walkthrough

Run the complete pipeline yourself — retrieval, task completion, refinement, and evaluation — instead of using pre-computed data.

Step 1: Download Skills and Start the Search Server

The search server provides keyword, semantic, and hybrid search over the 34k skill collection.

Download the skill collection and pre-built search index:

# Download skills and metadata (needed for /detail endpoint and building index from scratch)
hf download Shiyu-Lab/Skill-Usage skills-34k/skills.zip skills-34k/skills_meta.jsonl --repo-type dataset --local-dir .
unzip skills-34k/skills.zip -d skills/
cp skills-34k/skills_meta.jsonl skills/

# Download pre-built search index
hf download Shiyu-Lab/Skill-Usage search_index/search_index.zip --repo-type dataset --local-dir .
unzip search_index/search_index.zip -d search_server/index/

Using the pre-built index is preferred. Alternatively, you can build the index from scratch (requires a running vLLM instance serving Qwen/Qwen3-Embedding-4B and the full skill collection):

python search_server/index_builder.py

Install search server dependencies and start:

pip install -r search_server/requirements.txt
python search_server/http_server.py --include-content

Server arguments:

Flag	Default	Description
`--port`	8742	Port to serve on
`--model`	`Qwen/Qwen3-Embedding-4B`	Embedding model for semantic search (requires vLLM)
`--exclude-gt`	false	Use indexes built without ground-truth skills
`--include-content`	false	Include full SKILL.md content in semantic search
`--content-weight`	0.05	Blend weight for content vs metadata embeddings

Endpoints:

GET /keyword?q=...&top_k=10 — BM25 keyword search
GET /semantic?q=...&top_k=10 — Dense embedding search
GET /hybrid?q=...&top_k=10 — Reciprocal rank fusion of keyword + semantic
GET /detail/{skill_id} — Full SKILL.md content for a given skill

Step 2: Retrieve Skills for Tasks

First, copy the base tasks into a working directory:

# For SkillsBench
cp -r tasks skillsbench-retrieved-claude

# For Terminal-Bench 2.0
cp -r terminal-bench-2 terminalbench-retrieved-claude

Use agentic search where an agent iteratively formulates queries, inspects candidates, and selects the most relevant skills. The --tasks-dir argument points to the working task directory created above.

# Prepare Docker containers for skill retrieval
python scripts/retrieve_skills.py prepare \
    --tasks-dir skillsbench-retrieved-claude \
    --agent claude-code \
    --model claude-opus-4-6 \
    --max-workers 5

# For Kimi:  --agent terminus-2 --model openai/moonshotai/Kimi-K2.5
# For Qwen:  --agent qwen-coder --model Qwen/Qwen3.5-397B-A17B-FP8

# Run retrieval (config path is printed by the prepare step)
harbor run -c <config_path>

# Collect results — each task gets a found_skills.json with skill IDs sorted by relevance
python scripts/retrieve_skills.py collect \
    --tasks-dir skillsbench-retrieved-claude \
    --keep-temp

Then, copy the actual skill files into each task's environment/skills/ using select_top_k_skills.py. This reads each task's found_skills.json, copies the top-k skills from the skills/ directory, and produces a new task directory ready to run:

# Select top-5 skills per task (creates skillsbench-retrieved-claude-5/)
python scripts/select_top_k_skills.py 5 --tasks-dir skillsbench-retrieved-claude

For Terminal-Bench 2.0 tasks, you also need to rewrite the Dockerfiles to include skill COPY lines. Terminal-Bench tasks use prebuilt Docker images specified in task.toml, which bypasses the Dockerfile. This script removes the prebuilt image from task.toml and rewrites the Dockerfile to use it as the base with skills added on top:

python scripts/rewrite_dockerfile.py terminalbench-retrieved-claude-5

Step 3: Run Tasks

The output directory (e.g., skillsbench-retrieved-claude-5/) now contains tasks with skills in environment/skills/. To run the benchmark, use one of the example configs in experiments/configs/ and update the datasets.path field to point to your task directory:

# Edit the config to set your task directory path
# Then run:
harbor run -c <your_config>.yaml

Base configs are at experiments/configs/{skillsbench,terminalbench}/{claude,kimi,qwen}.yaml.

Step 4: Refine Skills (Optional)

Note: Both refinement scripts use .local-workspace/ for temp files and jobs by default. To run query-specific and query-agnostic refinement concurrently, use --workspace to separate them, e.g., --workspace .workspace-specific and --workspace .workspace-agnostic.

Query-Specific Refinement

The agent explores the target task, attempts a solution using retrieved skills, reflects on what helped or hurt, and composes improved skills. The --tasks-dir should point to a task directory that already has skills (e.g., from Step 2).

# Prepare (uses the task's own Dockerfile so the agent can test skills)
python scripts/query_specific_refinement.py prepare \
    --tasks-dir skillsbench-retrieved-claude-5 \
    --agent claude-code \
    --model claude-opus-4-6
# For Kimi:  --agent terminus-2 --model openai/moonshotai/Kimi-K2.5
# For Qwen:  --agent qwen-coder --model Qwen/Qwen3.5-397B-A17B-FP8

# Run refinement
harbor run -c <config_path>

# Copy task directory before collecting (collect overwrites environment/skills/)
cp -r skillsbench-retrieved-claude-5 skillsbench-retrieved-claude-5-query-specific

# Collect refined skills into the copy
python scripts/query_specific_refinement.py collect \
    --tasks-dir skillsbench-retrieved-claude-5-query-specific \
    --keep-temp

Query-Agnostic Refinement

Each skill is improved independently using the skill-creator methodology, without knowledge of the target task:

# Prepare per-skill refinement containers
python scripts/query_agnostic_refinement.py prepare \
    --tasks-dir skillsbench-retrieved-claude-5 \
    --agent claude-code \
    --model claude-opus-4-6
# For Kimi:  --agent terminus-2 --model openai/moonshotai/Kimi-K2.5
# For Qwen:  --agent qwen-coder --model Qwen/Qwen3.5-397B-A17B-FP8

# Run refinement
harbor run -c <config_path>

# Copy task directory before collecting (collect overwrites environment/skills/)
cp -r skillsbench-retrieved-claude-5 skillsbench-retrieved-claude-5-query-agnostic

# Collect and distribute refined skills into the copy
python scripts/query_agnostic_refinement.py collect \
    --tasks-dir skillsbench-retrieved-claude-5-query-agnostic \
    --keep-temp

Step 5: Evaluate

# Aggregate pass rates (mean +/- std across runs)
python scripts/calculate_results.py skillsbench-trajectories/jobs/<job_name>

# Analyze skill usage in agent trajectories
python scripts/check_skill_usage.py skillsbench-trajectories/jobs/<job_name>

# Judge skill coverage with an LLM (requires OpenAI-compatible API)
python scripts/judge_skill_coverage.py <tasks-dir> --model gpt-5.4

Skill Collection

The skill collection contains 34,198 real-world skills sourced from skillhub.club and skills.sh, filtered by permissive licenses (MIT and Apache 2.0), deduplicated, and cleaned. Each skill is a folder containing a SKILL.md file with structured metadata and content.

To download:

hf download Shiyu-Lab/Skill-Usage skills-34k/skills.zip --repo-type dataset --local-dir .
unzip skills-34k/skills.zip -d skills/

Citation

@misc{liu2026agenticskillsworkwild,
    title={How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings}, 
    author={Yujian Liu and Jiabao Ji and Li An and Tommi Jaakkola and Yang Zhang and Shiyu Chang},
    year={2026},
    eprint={2604.04323},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2604.04323}, 
}

Acknowledgments

This work builds on and is grateful to the following projects:

SkillsBench — the benchmark for evaluating how well AI agents use skills
Terminal-Bench 2.0 — a benchmark for hard, realistic tasks in command line interfaces
Harbor — the framework for evaluating and optimizing agents in container environments

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude/skills		.claude/skills
data		data
experiments/configs		experiments/configs
scripts		scripts
search_server		search_server
tasks		tasks
terminal-bench-2		terminal-bench-2
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Well Do Agentic Skills Work in the Wild

Overview

Setup

Repository Structure

Option 1: Reproduce Paper Results

Available Settings

Additional Options

Calculate Results

Option 2: Full Pipeline Walkthrough

Step 1: Download Skills and Start the Search Server

Step 2: Retrieve Skills for Tasks

Step 3: Run Tasks

Step 4: Refine Skills (Optional)

Query-Specific Refinement

Query-Agnostic Refinement

Step 5: Evaluate

Skill Collection

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How Well Do Agentic Skills Work in the Wild

Overview

Setup

Repository Structure

Option 1: Reproduce Paper Results

Available Settings

Additional Options

Calculate Results

Option 2: Full Pipeline Walkthrough

Step 1: Download Skills and Start the Search Server

Step 2: Retrieve Skills for Tasks

Step 3: Run Tasks

Step 4: Refine Skills (Optional)

Query-Specific Refinement

Query-Agnostic Refinement

Step 5: Evaluate

Skill Collection

Citation

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages