Powered by Nebius Token FactoryPick the right model for your coding agent.
ClawBench benchmarks your agent across open-weight models on Nebius, scores every output, and exports a routing.md your coding agent reads natively to pick the best model per task.
How it works
Three steps from a raw prompt to a production routing policy.
Connect your agent
Paste a one-shot snippet into your coding agent. It opens a public tunnel from your local agent to ClawBench using your own Nebius API key.
Run evals across models
Pick a task type and the Nebius models you want to compare. ClawBench runs the prompt through each, judges the output, and ranks them on quality, cost, and latency.
Export routing.md for your agent
ClawBench turns the winning models into a routing.md file — primary, fallback, and escalation rules per task type. Copy it into Cursor, Claude Code, or Lovable and your agent routes every prompt to the right model.
One file. Drop it into any coding agent.
routing.md is plain markdown — a summary table, per-task rules, and a decision algorithm your agent can follow without any SDK, plugin, or custom integration.
- Works with Cursor, Claude Code, Lovable, Continue, and any agent that reads context files.
- Re-export anytime as your evals improve — version it in your repo.
- Human-readable, so you can review and tweak before shipping.
# Routing Rules
## Summary
| Task type | Primary | Fallback |
| ------------- | ------------- | ------------- |
| Debugging | DeepSeek R1 | Kimi K2.5 |
| Coding | Qwen Coder 32B| Llama 3.3 70B |
| Reasoning | DeepSeek R1 | Llama 3.1 405B|
## Decision algorithm
function route(taskType, prompt):
rule = rules[taskType] or rules["coding"]
response = call(rule.primary_model, prompt)
if response.confidence < rule.confidence_threshold:
response = call(rule.fallback_model, prompt)
return responseStop guessing on model choice
Replace vibes with side-by-side scores on your real prompts — correctness, completeness, format reliability, and agent utility.
Your keys, your data
Your Nebius API key stays in your browser. Your agent runs locally. ClawBench only orchestrates the evals.
Cost-aware routing
Every recommendation is scored on cost-per-quality-point so you don't pay for a frontier model when a 70B will do.