n1c0.net/thesis |
Section | article |
|
|---|---|---|---|
| Reading time | ~10 min | ||
| Author | Nicola F. | ||
What 101,767 random portfolios taught me -- and why optimizing your portfolio on past data is mostly a self‑flattering exercise.
You see it everywhere. People ranking funds by Sharpe ratio, plotting efficient frontiers, picking the one shiny point in the top‑left corner, then expecting the future to look exactly like the backtest. It rarely does.
I bumped into this in a university competition once: take a benchmark, optimize on top of it, beat it on Sharpe. Everyone tuned their weights, everyone hit the frontier, everyone presented numbers that made the past look like a tailwind. The future had its own opinion.
Same story in conversations with friends in asset management, classmates, and the people I work with now. The portfolios that looked best in‑sample -- beautiful frontier point, double‑digit Sharpe -- were almost never the ones that delivered out‑of‑sample. They were fitted to a slice of history that had already moved on.
The way I think about it: it's like booking a restaurant on last year's reviews. The food was great when those reviews were written. The chef quit six months ago. The reviews are still up.
So the thesis question became simple:
Can ex‑ante portfolio characteristics actually predict ex‑post performance?
Before testing anything I needed data. Not just U.S., not just survivors, not just the slice that fits in a commercial subscription. Everything.
I tried EOD Historical Data first. Coverage had holes, corporate actions were inconsistent. After a few weeks patching their files I was spending more time fixing the source than analyzing it.
So I went back to something I already knew. A few years earlier I'd built a stack of scrapers for fun, monitoring impossible‑to‑buy items during lockdown. 300k+ requests a day, distributed proxies, the usual. Rusty but workable. I pointed it at financial data.
The funnel: 77,204 raw → 41,615 after ISIN dedup → 36,546 with 4+ years of history → 32,838 after density → 32,287 after liquidity.
32,287 securities · 78 countries · 143.4M daily price records · April 1964 → November 2025
Instead of cherry‑picking real funds, I generated random portfolios. 101,767 of them. Random size between 1 and 500 stocks, random weights, random formation dates from 1969 to 2023.
For each one I computed about 50 metrics on the 12 months before the formation date (ex‑ante, what an investor would have seen), and the same 50 metrics on the 12 months after (ex‑post, the truth). Then I checked how much one predicts the other.
Random portfolios don't lie. No survivorship bias, no career concerns, no marketing department picking which strategy to publish.
Click a column header to sort.
| Metric | R² | Correlation |
|---|---|---|
| Beta | 0.666 | +0.82 |
| Diversification Ratio | 0.383 | +0.62 |
| VaR 95% | 0.185 | +0.43 |
| Avg. Correlation | 0.157 | +0.40 |
| Volatility (annual) | 0.098 | +0.31 |
| Sharpe ratio | 0.024 | −0.16 |
| Alpha | 0.003 | −0.06 |
Risk metrics on top. Return metrics on the bottom. That's the whole story.
Sharpe ratio is the one that surprises people. The sign is negative. Past Sharpe doesn't fail to predict future Sharpe -- it anti‑predicts it. The portfolios that looked best last year systematically delivered the worst the next year.
Hover a row to compare ex‑ante vs ex‑post.
| Quintile (by ex‑ante Sharpe) | Ex‑ante | Ex‑post | Δ |
|---|---|---|---|
| Q1 (Low) | −0.93 | +1.48 | +2.41 |
| Q2 | +0.10 | +1.05 | +0.95 |
| Q3 | +0.96 | +1.06 | +0.10 |
| Q4 | +1.73 | +0.96 | −0.77 |
| Q5 (High) | +3.15 | +0.79 | −2.36 |
Bottom quintile gains +2.41. Top quintile loses −2.36. The "hot fund" of last year is the cold fund of this year, mechanically.
A way to picture it: if last summer was unusually hot, that doesn't mean this summer will be hotter -- it usually means closer to average. Markets do the same thing with returns. They don't with risk.
Beta, by contrast, is almost embarrassingly predictable. R² of 0.67 across 54 years. And it gets more predictable during crises, not less.
Beta's R² jumps from 0.65 in normal times to 0.79 in crises. Volatility persistence barely moves (0.08 to 0.07). Sharpe and Alpha get more anti-predictive when markets break.
Think of beta as a stock's sensitivity to the market. In calm times that sensitivity is muddied by idiosyncratic noise. In a crisis the noise gets crushed, everyone moves together, and the sensitivity shows through cleanly.
Volatility persistence decays with horizon, following:
R²(h) = 0.43 × exp(−h / 9.6) · half‑life ≈ 6.6 months
Predicted R² = 0.39 · strong persistence
Daily volatility is very persistent, monthly less so, annual barely at all. The clustering is real, it just dies out.
Before concluding any of this, I wanted to make sure I wasn't just missing non‑linear patterns that a smarter model would catch. I ran random forests and gradient boosting on the same data, two ways.
Random split (rows shuffled, train and test mixed across years):
| Target | Linear R² | RF R² | XGB R² |
|---|---|---|---|
| Sharpe | 0.09 | 0.57 | 0.60 |
| Volatility | 0.20 | 0.53 | 0.57 |
| Beta | 0.71 | 0.85 | 0.85 |
| Max Drawdown | 0.08 | 0.53 | 0.59 |
Looks great. Looks like ML cracked it.
Temporal split (train on pre‑2010, test on 2010+):
| Target | Linear R² | RF R² |
|---|---|---|
| Sharpe | −0.05 | −0.24 |
| Volatility | −0.26 | −1.18 |
| Max Drawdown | −0.48 | −3.65 |
| Alpha | −0.08 | −0.10 |
All negative. Worse than just guessing the mean. And the non‑linear models fail harder than the linear ones.
That's the whole regime story in one table. Non‑linear patterns exist within a regime, but they don't carry across. The shuffled split lets the model peek at the future through the training set, the temporal split forbids it, and once you forbid it the gains vanish. A model trained on 1970–2010 markets has no idea what 2010+ markets are going to do, and confidently being wrong is worse than humbly guessing the average.
It's the same reason you can't train a doctor on patients from one era and expect them to recognize a new disease. The patterns generalize until they don't.
Everyone knows the idea: don't put all your eggs in one basket. The interesting question is how you measure how spread out the basket is.
There are two families of metrics:
The catch: correlations are mechanically biased upward during high‑volatility periods (Forbes & Rigobon, 2002). Which means "diversification just broke down in the crisis" is partly a measurement artifact, not a structural change.
| Crisis | N | Composition R² | Correlation R² | Winner |
|---|---|---|---|---|
| Oil Crisis | 3,461 | 0.20 | 0.36 | Correlation |
| 1987 Crash | 745 | 0.00 | 0.93 | Correlation |
| Dot‑Com | 5,037 | 0.10 | 0.40 | Correlation |
| GFC | 2,769 | 0.07 | 0.41 | Correlation |
| COVID‑19 | 507 | 0.05 | 0.78 | Correlation |
| Pooled | 13,249 | 0.26 | 0.09 | Composition |
Within every single crisis, correlation wins. Pool them together and composition wins. That's Simpson's paradox.
In plain terms: correlation metrics are like a thermometer that gives accurate readings inside one room but is calibrated differently in every room. If you stay in one room, trust it. If you walk between rooms, use the wall clock instead -- it doesn't change.
The example I keep coming back to: take any optimizer, feed it the last 12 months of returns, and let it spit out a portfolio. The output will look fantastic. The Sharpe in the report will be enormous. None of that tells you whether the portfolio is structurally good. It tells you the optimizer did its job on the data you gave it. Whether that data resembles the next 12 months is not the optimizer's problem.
Stop optimizing Sharpe. Start measuring risk. Diversify on composition, validate on correlation. Be suspicious of any backtest that's a little too pretty.
Still building. Still learning. Still here.