Mar 27, 2026

Numerai Benchmark Models: The Bar You Need to Clear

How Numerai's benchmark_blender performs on MMC, where it ranks against staked models, and what it actually takes to beat it round after round.

Across 75 rounds of recent history, Numerai's benchmark_blender posts negative MMC in 49 of them — a 35% positive rate, with a lifetime sum of -0.18 MMC. Yet its round-by-round percentile rank still swings from the 7th to the 88th. "Beat the benchmark" is not a fixed bar; it's a moving regime indicator. If you can't beat it consistently, the issue may be that you're calibrating against a target that itself drifts 60 percentile points across a year.

This article looks at how benchmark_blender actually performs, where it ranks against the staked field, and whether beating it is getting harder. For background on the tournament mechanics, see How Numerai Works. To see where models currently stand, check the live leaderboard.

What Are Benchmark Models?

Benchmark models are Numerai's own submissions, trained on public data using documented methods. They provide a performance floor for new participants, feed into the meta-model's diversity, and set a transparent standard for what "good enough" looks like.

The benchmark_blender is the most-watched variant. It blends predictions from several example models into a single submission — competent but not exceptional, and designed to be beatable by anyone doing real feature engineering, target selection, or modeling work.

Cumulative Signal

Cumulative MMC over time for benchmark_blender and benchmark_models_te, showing benchmark_blender climbing to roughly +0.5 around round 770 then declining below zero by round 1240, while benchmark_models_te briefly spikes to +0.3 in rounds 640-780.

Benchmark models do not stake NMR, so payout is not the right yardstick — the meaningful question is how much MMC they accumulate over time. Cumulative MMC peaks for benchmark_blender near +0.5 around round 770, then erodes through the 800-1240 stretch and ends slightly negative. benchmark_models_te shows up only for a brief window in the high 600s and low 700s, briefly reaching +0.3 before its data run ends.

The asymmetry is the story: a baseline that built up a +0.5 MMC cushion needed roughly 470 rounds to give it all back, and ended worse than where it started. The drawdown phase took longer than the build phase. Common-signal ensembles tend to win in low-information regimes (where averaging beats searching) and lose in regimes that reward differentiated residual signal — and Numerai has spent more rounds in the second category than the first.

Where Does the Benchmark Sit?

The benchmark's percentile rank tells you how hard it is to beat. A 30th-percentile benchmark is easy — most participants already clear it. A 70th-percentile benchmark means beating it requires outperforming the majority.

Benchmark_blender MMC percentile rank over rounds 1185-1235, ranging from 15 to 85

Benchmark_blender's MMC percentile is anything but stable. Across roughly the last 50 rounds, it has swung from the high teens to the mid-80s. It peaked near the 85th percentile around round 1210 — a stretch where the ensemble was genuinely hard to beat — then collapsed into the 15-30 range by round 1230. See our MMC vs correlation primer if that metric is new to you.

"The benchmark" is not a fixed difficulty level. Whether your model beats it in any given week depends as much on regime as on skill.

Benchmark vs the Field

How does benchmark_blender stack up against the field median on raw MMC?

Benchmark_blender MMC versus field median MMC, both negative, with green and red shaded bands

Both lines spend substantial time below zero in this window, though benchmark_blender briefly rises above the field median during the middle stretch. Field-wide MMC has been running between about -0.02 and +0.01 over this window. Benchmark and median track each other closely because both are exposed to the same data and market regime. Green bands mark rounds where the benchmark beat the median; red bands mark underperformance. Neither wins consistently, though the benchmark has skewed below the median in the most recent rounds.

The tight coupling is expected. The benchmark trains on the same features everyone has, processed with standard methods, so it captures common signal without adding proprietary insight.

Can You Beat It?

What fraction of staked models actually beat the benchmark each round?

Percentage of staked models beating benchmark_blender on MMC, 10-round rolling average from 30% to 80%

The 10-round rolling average ranges from about 30% to 80%. The benchmark was hardest to beat around round 1210, when only ~30% of staked models cleared it. By the most recent rounds, the share has climbed back above 75% — the field is comfortably outperforming benchmark_blender again.

Rounds where the benchmark is hardest to beat are rounds where its conservative ensemble lines up unusually well with the scoring regime. When that happens, a public baseline can punch above its normal weight. When the regime rewards differentiated residual signal, custom models pull ahead.

Takeaways

The benchmark is a moving target, not a fixed median. Its percentile rank has swung from the 7th to the 88th over its 75 scored rounds — a 60+ point range. Judge your model over at least 20-30 rounds, not one.

Benchmark performance is sticky across rounds. When benchmark_blender finishes below the field median for two rounds in a row, the historical probability of a third sub-median round is 77% (24 of 31 instances). Two bad rounds usually presage a third, so do not panic-rebuild on a single hot or cold streak.

Use the benchmark for calibration, not competition. Marginally beating benchmark_blender means you've cleared a baseline that itself runs negative-MMC two-thirds of the time. The goal is unique signal that improves the meta-model, not edging out the floor.

There is still room above the benchmark. Across the most recent 15 rounds, between 28% and 79% of staked models beat benchmark_blender depending on the round — the share rotates with regime, and a median-of-field submission clears it more often than not.