Numerai Benchmark Models: The Bar You Need to Clear
How Numerai's benchmark_blender performs, where it ranks against staked models, and what beating it actually requires.
Numerai publishes benchmark models as public baselines anyone can examine, replicate, or try to beat. The most watched is benchmark_blender, an ensemble of Numerai's own example models that represents what you get from the official tutorial without original research. If your model cannot beat it, you are not adding signal.
This article looks at how the benchmark_blender actually performs, where it ranks against the staked field, and whether beating it is getting harder. For background on the tournament mechanics, see How Numerai Works. To see where models currently stand, check the live leaderboard.
What Are Benchmark Models?
Benchmark models are Numerai's own submissions, trained on public data using documented methods. They provide a performance floor for new participants, feed into the meta-model's diversity, and set a transparent standard for what "good enough" looks like.
The benchmark_blender is the most-watched variant. It blends predictions from several example models into a single submission — competent but not exceptional, and designed to be beatable by anyone doing real feature engineering, target selection, or modeling work.
Cumulative Signal

Benchmark models do not stake NMR, so payout is not the right yardstick — the meaningful question is how much MMC they accumulate over time. Cumulative MMC peaks for benchmark_blender near +0.5 around round 770, then erodes through the 800-1240 stretch and ends slightly negative. benchmark_models_te shows up only for a brief window in the high 600s and low 700s, briefly reaching +0.3 before its data run ends.
The takeaway: benchmarks contribute positive signal during favorable stretches and give it back when conditions turn. Over a full history they hover near break-even on MMC, which is exactly what a "competent baseline" should do.
Where Does the Benchmark Sit?
The benchmark's percentile rank tells you how hard it is to beat. A 30th-percentile benchmark is easy — most participants already clear it. A 70th-percentile benchmark means beating it requires outperforming the majority.

Benchmark_blender's MMC percentile is anything but stable. Across roughly the last 50 rounds, it has swung from the high teens to the mid-80s. It peaked near the 85th percentile around round 1210 — a stretch where the ensemble was genuinely hard to beat — then collapsed into the 15-30 range by round 1230. See our MMC vs correlation primer if that metric is new to you.
The practical lesson: "the benchmark" is not a fixed difficulty level. Whether your model beats it in any given week depends as much on regime as on skill.
Benchmark vs the Field
How does benchmark_blender stack up against the field median on raw MMC?

Both lines sit mostly in negative territory — field-wide MMC has been running between about -0.02 and +0.01 over this window. Benchmark and median track each other closely because both are exposed to the same data and market regime. Green bands mark rounds where the benchmark beat the median; red bands mark underperformance. Neither wins consistently, though the benchmark has skewed below the median in the most recent rounds.
The tight coupling is expected: the benchmark trains on the same features everyone has, processed with standard methods. It captures common signal without adding proprietary insight.
Can You Beat It?
What fraction of staked models actually beat the benchmark each round?

The 10-round rolling average ranges from about 30% to 80%. The benchmark was hardest to beat around round 1210, when only ~30% of staked models cleared it. By the most recent rounds, the share has climbed back above 75% — the field is comfortably outperforming benchmark_blender again.
Rounds where the benchmark is hardest to beat tend to be those where standard approaches happen to outperform creative ones. When conditions reward the ensemble's conservative construction, it punches above its weight. When conditions reward originality, custom models pull ahead.
Takeaways
The benchmark is a moving target, not a fixed median. Over the last ~50 rounds, its percentile rank has swung from the teens to the 80s. Judge your model over many rounds, not one.
Benchmark performance tracks regime. Standard approaches work better in some conditions than others. Do not conclude your model is broken after a few bad rounds against the benchmark.
Use the benchmark for calibration, not competition. The goal is to generate unique signal that improves the meta-model, not to marginally edge out a public baseline. The benchmark is a floor, not a ceiling.
Original research still pays off. Numerai refines its examples over time, but the core challenge has not shifted enough to make the benchmark unreachable. In recent rounds, more than 75% of staked models are beating benchmark_blender on MMC.