Scoring Version Roulette: Winners and Losers When Numerai Changes the Rules
Across six versions of MMC scoring, rank correlation between consecutive versions averages 0.81 but drops as low as 0.68 at v3-to-v4, and only 38% of models maintain a cross-version percentile standard deviation below 5 points.
Numerai has iterated through six versions of MMC scoring, multiple CORJ60 revisions, and several FNC variants. Each version change reshuffles the leaderboard. Some models climb, others fall, and a few barely notice.
The submission_scores table stores all versions simultaneously, making direct comparison possible. This post quantifies disruption across each transition. For context on what these metrics measure, see MMC vs Correlation.
How Disruptive Is Each Version Change?
Take all models scored under two consecutive MMC versions in the same round, rank them by each version, and compute the Spearman rank correlation. A correlation of 1.0 means the version change reordered nobody.

The v1-to-v2 transition was gentle at 0.92, and v2-to-v3 held near that level. But v3-to-v4 drops to 0.68, the most disruptive single transition in MMC's history. The v4-to-v5 and v5-to-v6 transitions recover to 0.78 and 0.83 respectively. The cross-transition average sits at 0.81 -- enough agreement to preserve broad structure, but enough noise to reshuffle roughly 40% of the top quartile in a bad transition.
Who Gets Reshuffled?
The human story is in the percentile shifts: how many models gained or lost 10, 20, or 30 percentile points?

The distribution is roughly symmetric around zero -- version changes are not systematically biased. But the tails are heavy. In the v3-to-v4 transition, about 18% of models shifted by more than 15 percentile points in either direction -- enough to flip the sign of a payout. Smaller transitions like v1-to-v2 keep most models within a 5-point band.
Are Any Models Version-Proof?
For each model, we compute the standard deviation of their percentile rank across all available MMC versions within the same round. Low standard deviation means a model performs similarly regardless of which formula applies.

About 38% of models qualify as robust (standard deviation below 5 percentile points). These tend to occupy the broad middle of the leaderboard -- generating signal that multiple scoring formulas agree on.
On the other end, 14% are version-dependent (standard deviation above 15 points). Some were top-quartile under v3 and bottom-quartile under v4. If you are staking on a model, this distinction matters more than its current rank. The benchmark models tend to cluster in the robust zone -- designed for broad signal rather than formula-specific edge cases.
How Far Has Canonical Scoring Drifted?
The canonical MMC value (canon_mmc) reflects whichever version Numerai currently considers authoritative. How much has it drifted from the original v1 formula?

In early rounds, the difference is near zero -- canon was v1. As Numerai iterated, the gap widened to roughly 0.012 per model in recent rounds, enough for meaningful percentile shifts near the median. The drift plateaus between version transitions and jumps when a new version becomes canonical.
This is why historical comparisons of "MMC over time" require specifying which version you mean -- visible in the trends dashboard and explored in the submission score journey.
Takeaways
Not all version changes are equal. The v3-to-v4 transition reshuffled rankings far more than any other, with a rank correlation of only 0.68. Participants who staked heavily based on v3 performance faced a rude surprise.
Most models are not version-proof. Only 38% maintain a cross-version percentile standard deviation below 5 points. If your model's rank swings wildly across versions, your edge may be an artifact of the scoring formula rather than genuine predictive power.
Canonical scoring has drifted meaningfully from v1. The average absolute divergence now sits around 0.012 per model. Historical performance comparisons require version-awareness, especially across the rounds list and medals thresholds.
Robustness is a signal. Models that score consistently across versions are likely capturing real market structure. Version-dependent models may be overfitting to the idiosyncrasies of a specific correlation computation -- a risk worth monitoring alongside the diversification dynamics of your portfolio.