meta-harness-optimization
Framework for automated search over task-specific model harnesses — the code around a fixed base model that decides what to store, retrieve, and show while the model works.
What this skill does
# Meta-Harness Optimization
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Meta-Harness is a framework for automated end-to-end search over **model harnesses** — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.
**Paper**: [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052)
**Homepage**: https://yoonholee.com/meta-harness/
## Core Concepts
| Term | Meaning |
|------|---------|
| **Harness** | All code around the base model: memory, retrieval, prompt construction, tool use |
| **Proposer Agent** | LLM (e.g. Claude Code) that proposes new harness variants |
| **Evaluator** | Runs proposed harnesses on a benchmark, returns a score |
| **Meta-Loop** | Iterative propose → evaluate → feedback cycle |
## Installation
Meta-Harness uses `uv` for dependency management. Each reference experiment is self-contained:
```bash
# Text classification experiment
cd reference_examples/text_classification
uv sync
# Terminal-Bench 2 experiment
cd reference_examples/terminal_bench_2
uv sync
```
No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.
## Quick Start
### Text Classification (Memory System Search)
```bash
cd reference_examples/text_classification
# Run 1 iteration of meta-harness optimization
uv run python meta_harness.py --iterations 1
# Run more iterations for better optimization
uv run python meta_harness.py --iterations 10
```
### Terminal-Bench 2 (Scaffold Evolution)
```bash
cd reference_examples/terminal_bench_2
# Smoke test with a single task
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
# General eval format:
# run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]
```
## Applying Meta-Harness to a New Domain
The recommended workflow uses the onboarding document with your AI coding assistant:
```bash
# 1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)
# and have a conversation about your domain. This produces domain_spec.md.
# 2. domain_spec.md will contain:
# - What the harness controls in your domain
# - How to evaluate harness quality (benchmark / metric)
# - What the proposer agent should modify
# - Constraints and budget considerations
```
### Minimum Required Components for a New Domain
```
my_domain/
├── pyproject.toml # uv-managed dependencies
├── domain_spec.md # generated via ONBOARDING.md conversation
├── meta_harness.py # main optimization loop
├── harness.py # base harness implementation
├── evaluator.py # benchmark runner → numeric score
└── claude_wrapper.py # proposer agent wrapper
```
## Implementing a Harness
A harness wraps a base model and manages context/memory/tools:
```python
# harness.py — minimal harness structure
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HarnessConfig:
model: str = "claude-3-5-sonnet-20241022"
memory_strategy: str = "last_k"
k: int = 5
retrieval_enabled: bool = False
system_prompt: str = "You are a helpful assistant."
class AgentHarness:
def __init__(self, config: HarnessConfig):
self.config = config
self.memory: list[dict] = []
def reset(self):
self.memory = []
def _build_context(self, new_input: str) -> list[dict]:
"""Core harness logic: what does the model see?"""
if self.config.memory_strategy == "last_k":
recent = self.memory[-self.config.k:]
elif self.config.memory_strategy == "all":
recent = self.memory[:]
else:
recent = []
return recent + [{"role": "user", "content": new_input}]
def step(self, user_input: str) -> str:
messages = self._build_context(user_input)
# Call base model with constructed context
response = call_model(
model=self.config.model,
system=self.config.system_prompt,
messages=messages
)
# Update memory
self.memory.append({"role": "user", "content": user_input})
self.memory.append({"role": "assistant", "content": response})
return response
```
## Implementing the Evaluator
```python
# evaluator.py — runs harness on benchmark, returns score
from harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float:
"""
Evaluate a harness configuration on a dataset.
Returns a scalar score (higher is better).
"""
harness = AgentHarness(config)
correct = 0
for example in dataset:
harness.reset()
prediction = harness.step(example["input"])
if grade(prediction, example["label"]):
correct += 1
return correct / len(dataset)
def grade(prediction: str, label: str) -> bool:
"""Task-specific grading logic."""
return label.lower().strip() in prediction.lower()
```
## The Meta-Harness Loop
```python
# meta_harness.py — the optimization loop
import json
from pathlib import Path
from evaluator import evaluate_harness
from claude_wrapper import run_proposer
def meta_harness_loop(
iterations: int = 10,
train_dataset: list = None,
val_dataset: list = None,
):
history: list[dict] = []
best_score = 0.0
best_config = None
for i in range(iterations):
print(f"\n=== Iteration {i+1}/{iterations} ===")
# 1. Propose: ask the proposer agent for a new harness variant
proposal = run_proposer(
history=history,
task_description="Optimize the memory system for text classification.",
code_context=Path("harness.py").read_text(),
)
# 2. Evaluate: run the proposed harness
try:
new_config = parse_proposal(proposal)
score = evaluate_harness(new_config, train_dataset)
except Exception as e:
score = 0.0
print(f"Evaluation failed: {e}")
# 3. Record: log result for proposer feedback
record = {
"iteration": i + 1,
"proposal": proposal,
"score": score,
}
history.append(record)
print(f"Score: {score:.4f}")
if score > best_score:
best_score = score
best_config = new_config
print(f"New best: {best_score:.4f}")
# Final validation on held-out set
if best_config and val_dataset:
val_score = evaluate_harness(best_config, val_dataset)
print(f"\nFinal val score: {val_score:.4f}")
return best_config, history
```
## Proposer Agent Wrapper (Claude Code)
The shipped examples use Claude Code as the proposer. Adapt `claude_wrapper.py`:
```python
# claude_wrapper.py — wraps proposer agent calls
import subprocess
import json
from pathlib import Path
def run_proposer(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
"""
Call Claude Code (or another proposer) to suggest harness modifications.
Logs all interactions for reproducibility.
"""
prompt = build_proposer_prompt(history, task_description, code_context)
# Example: call Claude via API
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
result = response.content[0].text
# Log for reproducibility
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
return result
def builRelated in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.