n1c0.net

/thesis
Section article
Reading time ~10 min
Author Nicola F.

PREDICTING PORTFOLIO PERFORMANCE

What 101,767 random portfolios taught me -- and why optimizing your portfolio on past data is mostly a self‑flattering exercise.

TL;DR

  1. Risk is predictable. Returns are not. Beta R² ≈ 0.67. Sharpe R² ≈ 0.02, sign negative.
  2. Don't chase past winners. Top‑Sharpe quintile mean‑reverts hard.
  3. Diversification works. Measuring it during a crisis does not. Simpson's paradox between composition and correlation metrics.
  4. 50–150 stocks is enough. Marginal gains flatten beyond ≈100, and the gains are lower volatility, not higher returns.
  5. ML doesn't rescue this. Random‑split looks brilliant, temporal split is worse than guessing the mean.

WHERE THIS STARTED

You see it everywhere. People ranking funds by Sharpe ratio, plotting efficient frontiers, picking the one shiny point in the top‑left corner, then expecting the future to look exactly like the backtest. It rarely does.

I bumped into this in a university competition once: take a benchmark, optimize on top of it, beat it on Sharpe. Everyone tuned their weights, everyone hit the frontier, everyone presented numbers that made the past look like a tailwind. The future had its own opinion.

Same story in conversations with friends in asset management, classmates, and the people I work with now. The portfolios that looked best in‑sample -- beautiful frontier point, double‑digit Sharpe -- were almost never the ones that delivered out‑of‑sample. They were fitted to a slice of history that had already moved on.

The way I think about it: it's like booking a restaurant on last year's reviews. The food was great when those reviews were written. The chef quit six months ago. The reviews are still up.

So the thesis question became simple:

Can ex‑ante portfolio characteristics actually predict ex‑post performance?

STEP ZERO: GETTING DATA THAT DIDN'T LIE

Before testing anything I needed data. Not just U.S., not just survivors, not just the slice that fits in a commercial subscription. Everything.

I tried EOD Historical Data first. Coverage had holes, corporate actions were inconsistent. After a few weeks patching their files I was spending more time fixing the source than analyzing it.

So I went back to something I already knew. A few years earlier I'd built a stack of scrapers for fun, monitoring impossible‑to‑buy items during lockdown. 300k+ requests a day, distributed proxies, the usual. Rusty but workable. I pointed it at financial data.

The funnel: 77,204 raw → 41,615 after ISIN dedup → 36,546 with 4+ years of history → 32,838 after density → 32,287 after liquidity.

32,287 securities · 78 countries · 143.4M daily price records · April 1964 → November 2025

Data funnel from raw scrape to final universe of 32,287 securities
Figure :: data funnel

THE EXPERIMENT

Instead of cherry‑picking real funds, I generated random portfolios. 101,767 of them. Random size between 1 and 500 stocks, random weights, random formation dates from 1969 to 2023.

For each one I computed about 50 metrics on the 12 months before the formation date (ex‑ante, what an investor would have seen), and the same 50 metrics on the 12 months after (ex‑post, the truth). Then I checked how much one predicts the other.

Random portfolios don't lie. No survivorship bias, no career concerns, no marketing department picking which strategy to publish.

Distribution of portfolio sizes
Figure :: portfolio size distribution
Distribution of formation dates
Figure :: formation date distribution

WHAT I FOUND

Click a column header to sort.

Metric Correlation
Beta0.666+0.82
Diversification Ratio0.383+0.62
VaR 95%0.185+0.43
Avg. Correlation0.157+0.40
Volatility (annual)0.098+0.31
Sharpe ratio0.024−0.16
Alpha0.003−0.06

Risk metrics on top. Return metrics on the bottom. That's the whole story.

Sharpe ratio is the one that surprises people. The sign is negative. Past Sharpe doesn't fail to predict future Sharpe -- it anti‑predicts it. The portfolios that looked best last year systematically delivered the worst the next year.

SHARPE MEAN‑REVERSION

Hover a row to compare ex‑ante vs ex‑post.

Quintile (by ex‑ante Sharpe) Ex‑ante Ex‑post Δ
Q1 (Low)−0.93+1.48+2.41
Q2+0.10+1.05+0.95
Q3+0.96+1.06+0.10
Q4+1.73+0.96−0.77
Q5 (High)+3.15+0.79−2.36

Bottom quintile gains +2.41. Top quintile loses −2.36. The "hot fund" of last year is the cold fund of this year, mechanically.

A way to picture it: if last summer was unusually hot, that doesn't mean this summer will be hotter -- it usually means closer to average. Markets do the same thing with returns. They don't with risk.

Sharpe quintile mean reversion chart
Figure :: Sharpe quintile mean reversion

RISK IS A DIFFERENT ANIMAL

Beta, by contrast, is almost embarrassingly predictable. R² of 0.67 across 54 years. And it gets more predictable during crises, not less.

Beta's R² jumps from 0.65 in normal times to 0.79 in crises. Volatility persistence barely moves (0.08 to 0.07). Sharpe and Alpha get more anti-predictive when markets break.

Think of beta as a stock's sensitivity to the market. In calm times that sensitivity is muddied by idiosyncratic noise. In a crisis the noise gets crushed, everyone moves together, and the sensitivity shows through cleanly.

R-squared comparison in normal vs crisis regimes
Figure :: R² in normal vs crisis regimes

VOLATILITY PERSISTENCE -- TRY IT

Volatility persistence decays with horizon, following:

R²(h) = 0.43 × exp(−h / 9.6) · half‑life ≈ 6.6 months

Predicted R² = 0.39 · strong persistence

Daily volatility is very persistent, monthly less so, annual barely at all. The clustering is real, it just dies out.

Term structure of R-squared
Figure :: term structure of R² across horizons

THE MACHINE LEARNING SANITY CHECK

Before concluding any of this, I wanted to make sure I wasn't just missing non‑linear patterns that a smarter model would catch. I ran random forests and gradient boosting on the same data, two ways.

Random split (rows shuffled, train and test mixed across years):

TargetLinear R²RF R²XGB R²
Sharpe0.090.570.60
Volatility0.200.530.57
Beta0.710.850.85
Max Drawdown0.080.530.59

Looks great. Looks like ML cracked it.

Temporal split (train on pre‑2010, test on 2010+):

TargetLinear R²RF R²
Sharpe−0.05−0.24
Volatility−0.26−1.18
Max Drawdown−0.48−3.65
Alpha−0.08−0.10

All negative. Worse than just guessing the mean. And the non‑linear models fail harder than the linear ones.

That's the whole regime story in one table. Non‑linear patterns exist within a regime, but they don't carry across. The shuffled split lets the model peek at the future through the training set, the temporal split forbids it, and once you forbid it the gains vanish. A model trained on 1970–2010 markets has no idea what 2010+ markets are going to do, and confidently being wrong is worse than humbly guessing the average.

It's the same reason you can't train a doctor on patients from one era and expect them to recognize a new disease. The patterns generalize until they don't.

Temporal vs random split comparison
Figure :: temporal vs random split, head to head

THE DIVERSIFICATION PARADOX

Everyone knows the idea: don't put all your eggs in one basket. The interesting question is how you measure how spread out the basket is.

There are two families of metrics:

The catch: correlations are mechanically biased upward during high‑volatility periods (Forbes & Rigobon, 2002). Which means "diversification just broke down in the crisis" is partly a measurement artifact, not a structural change.

CrisisNComposition R²Correlation R²Winner
Oil Crisis3,4610.200.36Correlation
1987 Crash7450.000.93Correlation
Dot‑Com5,0370.100.40Correlation
GFC2,7690.070.41Correlation
COVID‑195070.050.78Correlation
Pooled13,2490.260.09Composition

Within every single crisis, correlation wins. Pool them together and composition wins. That's Simpson's paradox.

In plain terms: correlation metrics are like a thermometer that gives accurate readings inside one room but is calibrated differently in every room. If you stay in one room, trust it. If you walk between rooms, use the wall clock instead -- it doesn't change.

Composition advantage by crisis
Figure :: composition advantage by crisis
Crisis specific correlations
Figure :: crisis-specific correlations

SO WHAT IS ACTUALLY WORTH DOING

  1. Risk is predictable. Returns are not. A beta of 1.3 today gives you 1.17 ± 0.26 a year out. A Sharpe of 2.0 today gives you 0.90 ± 2.9. One number is useful. The other is noise.
  2. Don't chase past winners at the portfolio level. The correlation is negative.
  3. Diversification works. Measuring it during a crisis does not. Use composition metrics for cross‑regime questions, correlation metrics for within‑regime ranking.
  4. 50–150 stocks captures most of the benefit. Beyond ≈100, marginal gains shrink fast. And the gains come from lower volatility, not higher returns.
  5. ML doesn't rescue any of this. It looks like it does, until you force the model to predict a regime it hasn't seen. Then it fails worse than a linear model.
Performance by portfolio size
Figure :: performance by portfolio size

The example I keep coming back to: take any optimizer, feed it the last 12 months of returns, and let it spit out a portfolio. The output will look fantastic. The Sharpe in the report will be enormous. None of that tells you whether the portfolio is structurally good. It tells you the optimizer did its job on the data you gave it. Whether that data resembles the next 12 months is not the optimizer's problem.

Stop optimizing Sharpe. Start measuring risk. Diversify on composition, validate on correlation. Be suspicious of any backtest that's a little too pretty.


Still building. Still learning. Still here.


← back to home