ClawBench Powered by Nebius Token Factory

Pick the right model for your coding agent.

ClawBench benchmarks your agent across open-weight models on Nebius, scores every output, and exports a routing.md your coding agent reads natively to pick the best model per task.

BYO API key — stays in your browser One-command local tunnel Auto-generated routing rules

How it works

Three steps from a raw prompt to a production routing policy.

Step 1

Connect your agent

Paste a one-shot snippet into your coding agent. It opens a public tunnel from your local agent to ClawBench using your own Nebius API key.

Step 2

Run evals across models

Pick a task type and the Nebius models you want to compare. ClawBench runs the prompt through each, judges the output, and ranks them on quality, cost, and latency.

Step 3

Export routing.md for your agent

ClawBench turns the winning models into a routing.md file — primary, fallback, and escalation rules per task type. Copy it into Cursor, Claude Code, or Lovable and your agent routes every prompt to the right model.

The deliverable

One file. Drop it into any coding agent.

routing.md is plain markdown — a summary table, per-task rules, and a decision algorithm your agent can follow without any SDK, plugin, or custom integration.

  • Works with Cursor, Claude Code, Lovable, Continue, and any agent that reads context files.
  • Re-export anytime as your evals improve — version it in your repo.
  • Human-readable, so you can review and tweak before shipping.
# Routing Rules

## Summary
| Task type     | Primary       | Fallback      |
| ------------- | ------------- | ------------- |
| Debugging     | DeepSeek R1   | Kimi K2.5     |
| Coding        | Qwen Coder 32B| Llama 3.3 70B |
| Reasoning     | DeepSeek R1   | Llama 3.1 405B|

## Decision algorithm
function route(taskType, prompt):
  rule = rules[taskType] or rules["coding"]
  response = call(rule.primary_model, prompt)
  if response.confidence < rule.confidence_threshold:
    response = call(rule.fallback_model, prompt)
  return response

Stop guessing on model choice

Replace vibes with side-by-side scores on your real prompts — correctness, completeness, format reliability, and agent utility.

Your keys, your data

Your Nebius API key stays in your browser. Your agent runs locally. ClawBench only orchestrates the evals.

Cost-aware routing

Every recommendation is scored on cost-per-quality-point so you don't pay for a frontier model when a 70B will do.

Ready to route smarter?

Create a free account and run your first eval in minutes.