Cross-section shrinkage lab (ETF panel)

Interactive diagnostics on a multi-ETF Yahoo Finance sample: principal components of the covariance of excess returns, cross-sectional R² versus model size, pseudo out-of-sample folds, and ridge-style attenuation. All copy and code here are written for this site.

What this page is for

This page is a self-contained quantitative lab: it explains how we summarize many correlated return series with a few directions, how cross-sectional fit behaves as you add those directions, and how a simple time-split check changes the picture versus in-sample numbers.

The charts read from JSON produced on your machine or in CI; they are not hand-drawn placeholders. Interpret results as engineering intuition on liquid ETFs, not as a substitute for peer-reviewed empirical design.

Why look at many sleeves at once?

When you expand beyond a single broad equity index, you introduce regions, factors, credit, commodities, and defensive sleeves that move on partly shared shocks.

A wide panel makes collinearity routine: estimation can chase noise unless rotations and penalties keep the solution stable when you ask the model to explain average dispersion across names.

Method intuition

We start from a rectangle of daily excess returns. Its covariance matrix encodes which combinations of assets wiggle together; eigenvectors rank those combinations by variance.

Projecting average returns onto the leading eigenvectors is a linear summary—useful for visualization and for spotting when extra directions stop helping out of sample.

Ridge attenuation on PC coefficients is one transparent way to show how a penalty shrinks tail directions more than the first few.

Linear algebra used here

Let R be a T×N matrix of synchronized daily excess returns. Column means give the average risk premium vector we try to describe. Column-centered R feeds a sample covariance S.

Eigenvectors of S are directions in asset space; the first few capture most shared movement. Regressing the mean return vector on the first K eigenvectors is a compact linear benchmark.

Adding K always helps in sample if nothing is penalized; the interesting question is whether a held-out calendar slice still looks structured when you refit directions on past data only.

Cross-sectional R² and why two curves disagree

In-sample cross-sectional R² compares fitted mean returns to realized column means using the same window that built S and the regression—flexibility usually lifts this curve.

Pseudo out-of-sample R² rebuilds eigenvectors on years up to Y, estimates loadings on past mean returns only, then scores how those loadings line up with the next year’s realized mean vector.

The dash-dot curve applies a small finite-sample scaling to the same fold-wise OOS values so the chart reminds you that raw ratios can look optimistic when K grows next to finite training length.

Interactive charts (Yahoo Finance ETF lab)

Figures read `data/shrinking-cross-section-study/concept_charts.json`. Regenerate with `npm run data:shrinking-cross-section` after `pip install -r engine/requirements.txt`. The job downloads adjusted closes, subtracts a short-rate proxy from ^IRX, aligns twenty ETFs, and recomputes PCA, regressions, folds, and ridge summaries.

Charts also try to fetch `/data/shrinking-cross-section-study/concept_charts.json` in the browser if the static page was built without the file—commit the JSON so production hosts always embed or serve it.

Live Yahoo Finance panel. Daily adjusted closes from Yahoo Finance; excess returns use ^IRX as a simple short-rate proxy. Metrics are computed in this repository for a liquid ETF sleeve and are intended as a transparent lab snapshot.Updated: 2026-04-20T05:27:15ZWindow: 2013-07-192026-04-17Days × assets: 3205 × 20CV fold chart uses K = 10 PCsPanel: SPY, QQQ, IWM, VTV, VUG, MTUM, QUAL, USMV, EFA, EEM, VEA, TLT, LQD, HYG, VNQ, GLD, XLE, XLF, XLK, XLV

What this shows: Model-complexity trade-off on the ETF panel: in-sample cross-sectional R² vs K principal directions, mean pseudo-OOS R² across one-year-ahead folds, and a finite-sample-shrunk OOS curve.

How to read it: K is the number of PC columns from the full-sample covariance used for the in-sample curve; OOS curves refit eigenvectors each fold on training data only, then map train gammas to the next calendar year’s realized mean excess returns.

How folds are built

Each fold uses all trading days through calendar year Y for training, then evaluates on year Y+1 only. Eigenvectors and regression coefficients reset every fold so the test year never enters the rotation matrix.

If a year is missing in the vendor feed the fold is skipped; expect fewer bars when listings are younger.

Data hygiene reminders

Inner-join every series on dates before taking means; document how you treat stale symbols or thin ETFs.

Keep vendor quirks explicit: corporate-action adjustments differ by endpoint; treat numbers as reproducible snapshots, not audit-grade accounting.

Reading ridge and DOF panels

The ridge panel contrasts unpenalized PC coefficients with a single λ chosen relative to the average eigenvalue on the leading block—purely illustrative, not a tuned hyperparameter search.

The effective-degrees-of-freedom trace shows how many singular directions stay active as λ grows; it is a diagnostic, not a trading signal.

Reproducibility

Run `npm run data:shrinking-cross-section` from the repo root. Commit `data/shrinking-cross-section-study/concept_charts.json` so teammates and static hosts receive the same snapshot.

`npm run build` copies that file into `public/data/...` via `scripts/ensure-shrinking-cross-section-public.mjs`.

Limitations

Yahoo Finance data can be revised, split-adjusted, and occasionally missing; ^IRX is only a rough daily risk-free proxy.

One-year mean-return blocks are coarse; they summarize macro years, not monthly rebalanced academic portfolios.

Any investment use requires your own data hygiene, transaction costs, leverage constraints, and regulatory review.