Open Benchmark

QuantitativeFinance-Bench

QFBench · Quantitative Finance Benchmark

The definitive benchmark for AI agents in quantitative finance. Hard problems. Real code. Verifiable outputs.

87Tasks/42Models/10,962Runs/61.7%Best Pass@1
View Leaderboard ↓

What Makes QFBench Different

Not another QA benchmark — agents must think like quants

General Coding

vs HumanEval / MBPP

HumanEval tests algorithm logic with unit tests. QFBench requires domain knowledge: Black-Scholes, hazard rates, OU processes. The math must be right, not just the code structure.

RAG / QA

vs FinanceBench (RAG)

The other FinanceBench asks questions about financial documents. Ours requires agents to write and execute quantitative code — no retrieval, no lookup, pure numerical implementation.

Terminal Ops

vs Terminal-Bench

Terminal-Bench evaluates CLI proficiency. QFBench evaluates whether agents can implement numerical methods correctly inside a Docker sandbox with Python financial libraries.

Quality Control

The Finance-Zero Rule

A non-agentic baseline: one LLM call, one script, one run. V11 tracks it separately so agentic CLI results are compared against a transparent single-shot baseline.

Leaderboard

Agent performance ranked by pass@1 across 87 tasks with complete 3-run CLI coverage

#12026-05-07

GPT-5.5

via codex-cli

61.7%pass@3 66.2%
#22026-05-07

claude-opus-4-7

via claude-code

61.2%pass@3 67.1%
#32026-05-07

GPT-5.3-codex

via codex-cli

60.8%pass@3 67.5%
#42026-05-07

claude-opus-4-6

via claude-code

59.2%pass@3 65%
#52026-05-07

GPT-5.4

via codex-cli

57.5%pass@3 63.8%
#62026-05-07

GPT-5.4-mini

via codex-cli

57.1%pass@3 68.8%
#72026-05-07

claude-sonnet-4-6

via claude-code

56.3%pass@3 67.1%
#82026-05-07

claude-sonnet-4-5

via claude-code

46.2%pass@3 60%
#92026-05-07

claude-haiku-4-5

via claude-code

20.8%pass@3 31.2%

pass@1 comparison

GPT-5.5
61.7%p@3 66.2%
claude-opus-4-7
61.2%p@3 67.1%
GPT-5.3-codex
60.8%p@3 67.5%
claude-opus-4-6
59.2%p@3 65%
GPT-5.4
57.5%p@3 63.8%
GPT-5.4-mini
57.1%p@3 68.8%
claude-sonnet-4-6
56.3%p@3 67.1%
claude-sonnet-4-5
46.2%p@3 60%
claude-haiku-4-5
20.8%p@3 31.2%
0%25%50%75%100%

V11 score heatmap

Model × task score field

Each pixel is the average score across three runs for one model on one task. Task IDs are shown as clickable vertical labels; CLI rows glow green and Finance-Zero rows glow blue.

0.00
0.50
1.00 CLI
Finance-Zero
model
1.13f-amendment-aware-crowding2.alpha-hedge-strategy3.american-option-fd-new4.asian-option-levy-curran5.barone-adesi-whaley6.barrier-garch-var7.binance-btc-participation-tca8.bl-regime-hmm9.bollinger-backtest-aapl10.brinson-sector-attribution11.bs-greeks-pde12.cir-bond-pricing13.cliquet-ratchet-pricing14.cme-hdd-option-pricing15.compound-option-geske16.copula-equity-fitting17.copula-sampling-rank-correlation18.corporate-action-adjustment19.credit-migration-matrix20.credit-portfolio-var-cvar21.credit-spread-decomposition22.creditmetrics-portfolio-var23.cross-sectional-momentum24.crypto-funding-rate-basis-carry25.cta-basel-capital26.dcc-garch-portfolio-var27.delta-hedging-pnl-simulation28.digital-barrier-options29.double-sort30.dupire-local-vol31.earnings-surprise-calculator32.etf-cross-asset-lead-lag33.etf-overlap-redemption-pressure34.event-study-earnings35.evt-pot-var36.ewma-portfolio-risk-decomposition37.fama-french-factor-model-new38.fft-compound-poisson39.first-passage-time40.fomc-tone-event-study41.form4-cross-sectional-sale-pressure42.fx-carry-forward-hedge43.fx-forward-cross-rate44.geometric-mean-reverting-jd45.historical-var-data-prep46.hull-white-swaption47.implied-vol-approximations48.interest-rate-cap-floor49.intraday-volume-fitting-and-execution-scheduling50.ipca-latent-factors51.kelly-var-sizing52.lob-pc-signal53.localvol-barrier54.lookback-options55.mc-greek-surface-156.merton-jump-diffusion57.momentum-backtest58.mtm-xccy-basis-desk59.ohlc-realized-vol-estimators60.option-put-call-parity-forward-audit61.ou-jump-commodity62.pairs-cointegration-kalman63.pca-factor-portfolio64.polars-api-migration65.realized-vol-estimators66.regime-cta-vol-target67.regime-riskparity-cvar68.residual-momentum69.sec-10k-report-long70.sec-8k-event-alpha71.sentiment-factor-alpha72.sma-crossover-spy73.smith-tail-index74.spread-option-kirk-margrabe75.stable-residual76.standard-var-methods77.stochvol-implied-surface-new78.structured-note-risk79.swap-curve-bootstrap-ois80.var-es-estimation81.variance-swap-replication82.yield-curve-bond-immunization83.yield-curve-bootstrap-immunization84.yield-curve-pca-dynamics85.zero-coupon-bootstrapping
avg
CLIGPT-5.5
73%
CLIGPT-5.3-codex
69%
CLIOpus 4.6
68%
CLIGPT-5.4
68%
CLIGPT-5.4-mini
68%
CLISonnet 4.5
54%
CLIHaiku 4.5
29%
FZFZ Opus 4.6
28%
FZFZ Sonnet 4.5
17%
FZFZ Haiku 4.5
14%
7 CLI agents × 85 tasks × 3 runs
3 Finance-Zero baselines tracked separately
Source: V11-RESULTS.md · d2ad3a2 · 2026-05-04 UTC

V11 uses three independent runs per task. ERR runs caused by Docker/verifier failures are excluded from both numerator and denominator. CLI comparison covers 80 tasks where all seven models have complete 3-round data.

Key Findings

Insights from the V11 three-run benchmark sweep

Winner

GPT-5.5 Leads V11

GPT-5.5 ranks first on pass@1 at 61.7% across the 80-task complete CLI comparison set. GPT-5.3-codex is close behind at 60.8%, with Opus 4.6 third at 59.2%.

Stability

pass@3 Shows Recovery Potential

GPT-5.4-mini posts the strongest pass@3 at 68.8%, showing that repeated attempts can recover many failures even when pass@1 trails the top models.

Baseline

Finance-Zero Remains Far Behind

The best non-agentic Finance-Zero baseline reaches 24.7% pass@1 across 83 valid tasks, well below the CLI-agent leaderboard and useful as a quality-control floor.

Task Catalog

Loading tasks from main…

Run It Yourself

Evaluate any agent on QFBench using the Harbor framework

# 1. Install & build sandbox
pip install harbor

# Build sandbox base image (one-time, ~5 minutes)
docker build -t quantitativefinance-bench-sandbox:latest   -f docker/sandbox.Dockerfile .
# 2. Run an agent
export ANTHROPIC_API_KEY=<your-key>

# Run Claude Code on all calibration tasks
harbor run --path ./tasks     --agent claude-code     --model anthropic/claude-sonnet-4-20250514

# Or target a single task
harbor run --path ./tasks     --task-name cds-pricing     --agent claude-code     --model anthropic/claude-haiku-3-5-20241022
# 3. Finance-Zero baseline (free with Gemini)
export GEMINI_API_KEY=<your-key>

harbor run --path ./tasks     --agent-import-path agents.finance_zero:FinanceZeroAgent     --model gemini/gemini-2.0-flash

Results are saved to jobs/<timestamp>/result.json. Each run creates agent trajectories, test output, and token usage logs.

How It Works

Rigorous evaluation powered by Harbor. Binary pass/fail scoring with strict numerical tolerances.

01

Task Specification

Each agent receives instruction.md with input data and evaluation criteria. No hints, no examples — just the problem.

02

Sandbox Execution

The agent writes and runs code inside a Docker sandbox with Python, NumPy, Pandas, TA-Lib pre-installed. Full iteration allowed.

03

Verification

Harbor runs pytest against the agent's output. Strict numerical tolerances. Pass or fail — no partial credit.

04

Finance-Zero Baseline

A single-call non-agentic baseline: one LLM call, one script, one run. V11 reports it separately from CLI-agent results.

What's Next

Maintaining the V11 leaderboard while the benchmark closes the 90-task target

Planned

Per-task public matrix

website

87 Tasks Done

Finance-Zero Baseline

single-call scripts

87/90 Merged

Full 90 Tasks

all models

Latest News

Latest updates from the QFBench project

View all news →
N-003Leaderboard2026-05-04

V11 Leaderboard Published Across 80 Complete CLI Tasks

The homepage leaderboard now reflects V11 pass@1/pass@3 results: GPT-5.5 leads at 61.7% pass@1, followed by GPT-5.3-codex and Opus 4.6.

V11 ranks CLI agents by pass@1 across the 80 tasks where all seven evaluated models have complete three-run coverage.

N-002Dataset2026-05-04

QFBench Expands to 87 Merged Tasks

The benchmark repository has grown to 87 merged quantitative finance tasks, with the full 90-task milestone now in sight.

QFBench main now includes 87 merged tasks across derivatives pricing, risk, market microstructure, factor research, credit, crypto, and event-driven workflows.

N-001New2026-04-18

Weekly QFBench Discussion Is Open to Everyone

Join our weekly discussion to talk about benchmark progress, quantitative finance tasks, and upcoming evaluation updates.

We welcome everyone to join our weekly QFBench discussion.

Meeting link: https://meet.google.com/oyz-oyky-urc

Meeting time: Saturday 4:00 PM PST

Contributors

Thank you to everyone who has contributed to the benchmark or this website. Sourced from QFBench and finbench-website.

Loading contributors…

How to Contribute

QFBench is community-built. Every task on this leaderboard was contributed by the community — yours could be next.

What makes a good task?

QFBench tasks must require real quant expertise: numerical methods, dirty data, and verifiable outputs. Not trivia — real professional workflows that a senior quant would recognize. See the full guide for design principles and examples.

Task contribution guide →

Task requirements

  • Tasks must be verifiable and easy to verify: explicit output contract (what to produce and where to save it), programmatically checkable by code (e.g. np.isclose).
  • instruction.md and task.toml must be written entirely by humans. instruction.md must not reference which skills to use — the agent must figure that out itself.
  • The reference solution must not be leaked via skills or the Dockerfile; no task-specific hints that give away the answer.
  • Oracle must pass 100%: the reference solution must pass all tests. Run harbor run --path ./tasks --task-name <task-id> --agent oracle and confirm every test passes before submitting.
  • Compare against Finance-Zero: run the single-shot baseline and report it separately from CLI-agent runs.
  • Deterministic: same input → same output; no external APIs at runtime.
  • Use real data, not synthetic — real data has missing values, outliers, mixed formats.
  • Tasks must represent realistic professional workflows without artificial difficulty. The problem itself should be fundamentally hard, not an ordinary problem made adversarial so that agents score low and one can claim hardness.

Task format

Every task directory must include:

tasks/<task-id>/
+-- task.toml                                 # Metadata & resource limits
+-- instruction.md                            # Agent-facing problem statement
+-- environment/
|   +-- Dockerfile                            # Inherits from quantitativefinance-bench-sandbox
|   +-- data/                                 # Input datasets
|   \-- skills/                               # OPTIONAL — skills available to agent
|       \-- <skill-name>/
|           +-- SKILL.md
|           +-- scripts/                      # optional
|           \-- ...
+-- tests/
|   +-- test.sh                               # Harbor verifier entry-point
|   \-- test_outputs.py                       # Pytest assertions
\-- solution/
    \-- solve.sh                              # Reference (oracle) solution

Workflow

  1. Design the task and implement all required files (instruction, metadata, environment, tests, reference solution).
  2. Run harbor run --path ./tasks --task-name <task-id> --agent oracle — oracle must pass 100%.
  3. Run Finance-Zero baseline and report the result separately from agentic CLI attempts.
  4. Run at least two frontier agents from different companies (see the "Frontier (Strongest)" section in the model reference) and record results; include screenshots and a summary table in your PR.
  5. Open a PR with your task under tasks/.

FAQ

What kind of tasks are we looking for?

See the task design principles and difficulty guide in the task contribution guide.

How do I qualify for authorship?

Three high-quality tasks merged to main count as automatic authorship. Your set must include at most one easy task and at least one hard task. Edge cases (e.g. two hard tasks) need to be reviewed on a case-by-case basis.

What if I contribute fewer tasks but help in other ways?

We count other contributions too: engineering (infrastructure, tooling, CI/CD), running experiments, and paper writing. We’re flexible — if you want to help, reach out.

Resources