meta-harness-optimization

Included with Lifetime

$97 forever

Framework for automated search over task-specific model harnesses — the code around a fixed base model that decides what to store, retrieve, and show while the model works.

Productivity

What this skill does


# Meta-Harness Optimization

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Meta-Harness is a framework for automated end-to-end search over **model harnesses** — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.

**Paper**: [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052)  
**Homepage**: https://yoonholee.com/meta-harness/

## Core Concepts

| Term | Meaning |
|------|---------|
| **Harness** | All code around the base model: memory, retrieval, prompt construction, tool use |
| **Proposer Agent** | LLM (e.g. Claude Code) that proposes new harness variants |
| **Evaluator** | Runs proposed harnesses on a benchmark, returns a score |
| **Meta-Loop** | Iterative propose → evaluate → feedback cycle |

## Installation

Meta-Harness uses `uv` for dependency management. Each reference experiment is self-contained:

```bash
# Text classification experiment
cd reference_examples/text_classification
uv sync

# Terminal-Bench 2 experiment
cd reference_examples/terminal_bench_2
uv sync
```

No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.

## Quick Start

### Text Classification (Memory System Search)

```bash
cd reference_examples/text_classification

# Run 1 iteration of meta-harness optimization
uv run python meta_harness.py --iterations 1

# Run more iterations for better optimization
uv run python meta_harness.py --iterations 10
```

### Terminal-Bench 2 (Scaffold Evolution)

```bash
cd reference_examples/terminal_bench_2

# Smoke test with a single task
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf

# General eval format:
# run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]
```

## Applying Meta-Harness to a New Domain

The recommended workflow uses the onboarding document with your AI coding assistant:

```bash
# 1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)
# and have a conversation about your domain. This produces domain_spec.md.

# 2. domain_spec.md will contain:
#   - What the harness controls in your domain
#   - How to evaluate harness quality (benchmark / metric)
#   - What the proposer agent should modify
#   - Constraints and budget considerations
```

### Minimum Required Components for a New Domain

```
my_domain/
├── pyproject.toml          # uv-managed dependencies
├── domain_spec.md          # generated via ONBOARDING.md conversation
├── meta_harness.py         # main optimization loop
├── harness.py              # base harness implementation
├── evaluator.py            # benchmark runner → numeric score
└── claude_wrapper.py       # proposer agent wrapper
```

## Implementing a Harness

A harness wraps a base model and manages context/memory/tools:

```python
# harness.py — minimal harness structure
from dataclasses import dataclass, field
from typing import Any

@dataclass
class HarnessConfig:
    model: str = "claude-3-5-sonnet-20241022"
    memory_strategy: str = "last_k"
    k: int = 5
    retrieval_enabled: bool = False
    system_prompt: str = "You are a helpful assistant."

class AgentHarness:
    def __init__(self, config: HarnessConfig):
        self.config = config
        self.memory: list[dict] = []

    def reset(self):
        self.memory = []

    def _build_context(self, new_input: str) -> list[dict]:
        """Core harness logic: what does the model see?"""
        if self.config.memory_strategy == "last_k":
            recent = self.memory[-self.config.k:]
        elif self.config.memory_strategy == "all":
            recent = self.memory[:]
        else:
            recent = []
        
        return recent + [{"role": "user", "content": new_input}]

    def step(self, user_input: str) -> str:
        messages = self._build_context(user_input)
        # Call base model with constructed context
        response = call_model(
            model=self.config.model,
            system=self.config.system_prompt,
            messages=messages
        )
        # Update memory
        self.memory.append({"role": "user", "content": user_input})
        self.memory.append({"role": "assistant", "content": response})
        return response
```

## Implementing the Evaluator

```python
# evaluator.py — runs harness on benchmark, returns score
from harness import AgentHarness, HarnessConfig

def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float:
    """
    Evaluate a harness configuration on a dataset.
    Returns a scalar score (higher is better).
    """
    harness = AgentHarness(config)
    correct = 0
    
    for example in dataset:
        harness.reset()
        prediction = harness.step(example["input"])
        if grade(prediction, example["label"]):
            correct += 1
    
    return correct / len(dataset)

def grade(prediction: str, label: str) -> bool:
    """Task-specific grading logic."""
    return label.lower().strip() in prediction.lower()
```

## The Meta-Harness Loop

```python
# meta_harness.py — the optimization loop
import json
from pathlib import Path
from evaluator import evaluate_harness
from claude_wrapper import run_proposer

def meta_harness_loop(
    iterations: int = 10,
    train_dataset: list = None,
    val_dataset: list = None,
):
    history: list[dict] = []
    best_score = 0.0
    best_config = None

    for i in range(iterations):
        print(f"\n=== Iteration {i+1}/{iterations} ===")

        # 1. Propose: ask the proposer agent for a new harness variant
        proposal = run_proposer(
            history=history,
            task_description="Optimize the memory system for text classification.",
            code_context=Path("harness.py").read_text(),
        )

        # 2. Evaluate: run the proposed harness
        try:
            new_config = parse_proposal(proposal)
            score = evaluate_harness(new_config, train_dataset)
        except Exception as e:
            score = 0.0
            print(f"Evaluation failed: {e}")

        # 3. Record: log result for proposer feedback
        record = {
            "iteration": i + 1,
            "proposal": proposal,
            "score": score,
        }
        history.append(record)
        print(f"Score: {score:.4f}")

        if score > best_score:
            best_score = score
            best_config = new_config
            print(f"New best: {best_score:.4f}")

    # Final validation on held-out set
    if best_config and val_dataset:
        val_score = evaluate_harness(best_config, val_dataset)
        print(f"\nFinal val score: {val_score:.4f}")

    return best_config, history
```

## Proposer Agent Wrapper (Claude Code)

The shipped examples use Claude Code as the proposer. Adapt `claude_wrapper.py`:

```python
# claude_wrapper.py — wraps proposer agent calls
import subprocess
import json
from pathlib import Path

def run_proposer(
    history: list[dict],
    task_description: str,
    code_context: str,
) -> str:
    """
    Call Claude Code (or another proposer) to suggest harness modifications.
    Logs all interactions for reproducibility.
    """
    prompt = build_proposer_prompt(history, task_description, code_context)
    
    # Example: call Claude via API
    import anthropic
    client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    
    result = response.content[0].text
    
    # Log for reproducibility
    log_entry = {"prompt": prompt, "response": result}
    with open("proposer_log.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")
    
    return result

def buil

Files: 1

Size: 12.1 KB

Complexity: 20/100

Category: Productivity

Source: https://github.com/aradotso/trending-skills/tree/main/skills/meta-harness-optimization

Related in Productivity

gitea-workflow

Included

Orchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.

Productivityscripts

microsoft-graph-gateway

Included

Route Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.

Productivityscripts

copilotkit

Included

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Provides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.

Productivityscripts

swain-session

Included

Session management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.

Productivityscripts

gandi

Included

Comprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.

Productivityscripts

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Productivityscripts

swain-session

Included

Productivityscripts

gandi

Included

Productivityscripts