Hierarchical PCA and Modeling Asset Correlations

Hierarchical PCA asset correlations sector-based HPCA statistical clustering K-means dynamic blocks PCA vs HPCA spectrum cluster portfolios Avellaneda Serur 2020

Abstract

This study implements Hierarchical Principal Component Analysis (HPCA) for modeling cross-sectional correlations among US large-cap equities, following Avellaneda and Serur (2020). Classical PCA extracts eigenportfolios from the full empirical correlation matrix, but higher-order components are often unstable and economically opaque — the *identification problem*. HPCA imposes a tree-like block structure: assets are grouped into sectors or data-driven clusters, each with its own first eigenportfolio factor, while cross-block residual correlations are set to zero. We compare three block definitions — fixed GICS sectors, statistical sign-pattern clusters from PCA eigenvectors, and K-means partitions of PCA loadings — and evaluate equal-weight cluster portfolios, long-short spreads, and PCA-versus-HPCA eigenvalue spectra on a sector-balanced S&P 500 panel. Results are descriptive and research-oriented; they illustrate how dynamic clustering can adapt HPCA to evolving correlation regimes without claiming live trading alpha.

Loading results…

Introduction: The Identification Problem in Factor Models

Principal Components Analysis (PCA) is widely used to extract common risk factors from the correlation matrix of standardized asset returns. The first eigenportfolio explains the maximum variance; recursively, additional orthogonal directions capture remaining structure. In matrix form, for correlation matrix , the leading eigenvector solves:

with eigenvalue satisfying . The Karhunen–Loève representation of standardized return is:

where are uncorrelated eigenportfolio returns with unit variance. A one-factor approximation gives with , but residuals are generally correlated across assets — especially when and belong to unrelated economic sectors.

Avellaneda and Serur (2020) argue that correlations between economically unrelated stocks (e.g. technology vs. energy) are noisy and difficult to estimate reliably — the identification problem. HPCA mitigates this by embedding economic or statistical partition information into the correlation model, producing interpretable sector factors and a positive semi-definite block-structured matrix .

Classical PCA and Correlation Estimation

Let denote the simple return of asset at time , and standardize to using the in-sample mean and volatility . The empirical correlation matrix is:

for a sample of observations (here, monthly returns aggregated from daily prices). PCA on yields eigenvalues and eigenvectors . Random matrix theory (Laloux et al., 2000) suggests discarding eigenvalues below the Marcenko–Pastur bound when separating signal from noise.

A -factor PCA model writes:

After defactoring, the residual correlation matrix should have a top eigenvalue consistent with pure noise if is adequate. In practice, is chosen by inspecting the scree plot or RMT cutoff.

The limitation for portfolio management is that rarely admit stable sector interpretations and fluctuate across estimation windows — motivating a model that uses partition information explicitly.

Hierarchical PCA: Block Structure and the HPCA Assumption

Consider a universe of assets partitioned into blocks , each containing assets. Define the block assignment function if asset belongs to block . Within each block, run PCA and retain the first eigenportfolio factor . For each asset :

where is the regression loading on the block's leading factor.

The HPCA assumption (Avellaneda & Serur, Eq. 10) states that residuals are uncorrelated across blocks:

Intra-block correlations for remain equal to the empirical correlations within the sector. Cross-block entries of the model correlation matrix are:

where is the correlation between block-level factors. Proposition 1 (Avellaneda & Serur): is symmetric, has unit diagonal, and is positive semi-definite — hence it is a valid correlation matrix for a multivariate model.

Constructing \(\tilde{R}\): Sector Factors, Betas, and Cross-Block Links

The empirical pipeline for each block proceeds as follows. Let be the matrix of standardized returns for assets in block (stocks as rows, time as columns). Compute the first principal component:

The block factor return series is the eigenportfolio:

For each asset in block , the loading is , estimated by Pearson correlation over the sample window.

Cross-block factor correlations are computed from the time series and . The full matrix is assembled block-by-block: diagonal blocks copy the empirical ; off-diagonal blocks are rank-one outer products .

Geometrically, HPCA induces a tree structure: a root (full market), branches (sectors or clusters), and leaves (individual stocks). A two-layer model uses sectors only; extensions add sub-sectors, underlyings for derivatives, or obligor groupings in credit — mathematically identical, with deeper trees.

The correlation heatmaps and sector-factor loading table below compare empirical with sector-based on a twelve-name subset (highest ), illustrating how cross-block entries compress while intra-block structure is preserved.

Loading results…
Loading results…

Spectral Properties: Eigenvalues and Eigenvectors of \(\tilde{R}\)

A key theoretical result (Proposition 2, Avellaneda & Serur) is that HPCA eigenstructure decomposes cleanly across blocks. Let and be the -th eigenvalue and eigenvector of block- correlation matrix . Embed sector eigenvectors into :

The vectors form an orthogonal basis of . The subspace generated by the first eigenvector of each block is invariant under .

Consequently, the leading market-wide modes of are linear combinations of sector first factors — economically interpretable as sector rotations rather than opaque higher-order eigenportfolios. The bar chart below compares eigenvalue spectra of (classical PCA) and (sector HPCA), reporting Frobenius distance and mean absolute off-diagonal gap .

Loading results…

Dynamic Clustering: Statistical and K-Means Partitions

Fixed GICS sectors provide economically meaningful blocks but may not align with time-varying correlation structure. Avellaneda and Serur introduce statistical clustering to identify homogeneous groups of stocks sharing common risk-factor sign patterns. Two approaches are implemented here.

Statistical (sign-pattern) clustering. Fit PCA on with components. Omit the first (market) eigenvector and retain eigenvectors . For each asset , form the sign vector . Assets with identical sign patterns belong to the same cluster:

Clusters are re-labelled to consecutive integers. HPCA is then applied with instead of the GICS sector.

K-means on PCA loadings. Extract the first principal components of , forming the loading matrix . Apply K-means:

with clusters and centroids . This groups assets by proximity in factor-exposure space before block-wise HPCA estimation.

Both methods produce synthetic sectors that adapt to the current correlation matrix, enabling rolling re-estimation as market structure evolves. Portfolio results for each method appear in the sections below.

Empirical Results: Statistical Sign-Pattern Clustering

After assigning stocks to sign-pattern clusters, HPCA is re-estimated on the dynamic blocks. Equal-weight cluster portfolios are formed and cumulative wealth is tracked. The long-short spread contrasts the highest- and lowest-Sharpe clusters, benchmarked against SPY.

The charts below show cluster cumulative returns, the long-short spread versus the benchmark, and per-cluster Sharpe ratios. Compare sector diversity within each cluster — homogeneous sign-pattern groups may span multiple GICS industries when factor exposures align.

Loading results…

Empirical Results: K-Means on PCA Loadings

K-means partitions assets in the space of the first PCA loadings before HPCA block estimation. Portfolio construction follows the same equal-weight and long-short framework as the statistical clustering section.

Because K-means groups by Euclidean distance in factor-loading space rather than discrete sign identity, clusters may differ substantially from sign-pattern partitions — particularly when loadings are continuous rather than binary. The exhibits below allow direct comparison of cluster return paths and Sharpe rankings.

Loading results…

Portfolio Construction, Risk Metrics, and Benchmark Comparison

For each clustering scheme , define the equal-weight cluster portfolio return at month :

where . Annualized performance statistics (assuming 12 months per year) are:

A long-short spread portfolio goes long the highest-Sharpe cluster and short the lowest-Sharpe cluster:

These metrics are reported in the statistical and K-means result sections above. All figures are in-sample descriptive statistics benchmarked against SPY; they do not incorporate transaction costs or shorting constraints.

Data and Empirical Design

The investable universe is a sector-balanced sample of S&P 500 constituents: approximately five names per GICS sector, selected with a fixed random seed for reproducibility. Daily adjusted closes from Yahoo Finance (2015 onward) aggregate to monthly simple returns . SPY serves as the broad-market benchmark.

Three HPCA variants are estimated on the same return panel:

  • Sector HPCA: blocks defined by GICS industry labels (baseline, analogous to NAICS sectors in CRSP-based studies).
  • Statistical HPCA: blocks from sign-pattern clustering on PCA eigenvectors ( components).
  • K-means HPCA: blocks from K-means on the first 4 PCA loadings ( clusters).

For each variant, the pipeline computes , cluster portfolio returns, Sharpe ratios, long-short spreads, and eigenvalue spectra. The PCA-versus-HPCA comparison uses the sector-based against the classical empirical .

Limitations and Caveats

HPCA assumes uncorrelated residuals across blocks — a deliberate sparsity restriction that improves interpretability but may understate true cross-sector linkages during systemic crises when all correlations rise toward unity.

Estimation uses a single in-sample window; rolling or expanding re-estimation would better capture regime shifts but increases computational cost and introduces look-ahead considerations if not carefully implemented.

Yahoo Finance data and GICS labels differ from the CRSP/NAICS panel in the original Avellaneda & Serur (2020) study. Results illustrate methodology rather than replicate published CRSP numbers exactly.

Cluster Sharpe ratios and long-short spreads are in-sample descriptive statistics. They do not account for transaction costs, shorting constraints, or multiple-testing correction across cluster partitions.

This document is research output for education and quantitative discussion. It is not investment advice.

Conclusion

Hierarchical PCA offers a principled resolution to the PCA identification problem in large equity universes. By partitioning assets into economically or statistically meaningful blocks and imposing zero cross-block residual correlation, the model produces a valid, interpretable correlation matrix with an explicit tree structure.

Three lessons emerge from this empirical study. First, the eigenvalue spectrum of diverges measurably from classical , confirming that the block constraint is not a cosmetic relabelling — it reallocates variance across modes in a way that favours sector-factor interpretability. Second, sector factor loadings vary widely within blocks, reminding us that HPCA retains full intra-block empirical richness while only sparsifying cross-block entries. Third, dynamic clustering — whether by sign patterns or K-means on loadings — allows HPCA to adapt as correlation geometry shifts, extending the static sector framework of Avellaneda and Serur (2020) to time-varying market structure.

For portfolio managers, the framework supports risk decomposition (which sector factors drive co-movement?), correlation forecasting under structural priors, and cluster-based portfolio construction. The long-short spreads and Sharpe rankings reported here are descriptive summaries of in-sample cluster differentiation, not validated trading strategies. Future work should examine rolling re-estimation, out-of-sample correlation forecast accuracy, and integration with mean-variance or risk-parity optimisers under .

The summary statistics below synthesise the latest estimation run.

Loading results…

References

Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Modeling Asset Correlations. SSRN Working Paper 3903460. [https://ssrn.com/abstract=3903460](https://ssrn.com/abstract=3903460)

Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Applications to Portfolio Management. *Revista Mexicana de Economía y Finanzas*, 15(1), 1–18.

Avellaneda, M., & Lee, J.-H. (2010). Statistical arbitrage in the U.S. equities market. *Quantitative Finance*, 10(7), 761–782.

Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (2000). Random matrix theory and financial correlations. *International Journal of Theoretical and Applied Finance*, 3(3), 391–397.

Jolliffe, I. T. (2002). *Principal Component Analysis* (2nd ed.). Springer.

QuantifiedTrader logoQuantifiedTrader

Independent quantitative research on trading methods, backtesting, and market analytics.

Research disclaimer

QuantifiedTrader is operated by an independent quantitative research group. We study, document, and compare different methods of trading, portfolio construction, risk management, and investment analysis. Our work is exploratory and academic in nature—we build tools, run backtests, and publish findings to advance understanding, not to promote any particular strategy or product.

Not investment advice. Nothing on this website constitutes investment, trading, financial, tax, legal, or other professional advice. We do not recommend, endorse, or solicit the purchase or sale of any security, derivative, or financial instrument, nor do we suggest that any strategy, model, or result presented here is suitable for any individual or institution. Any examples, simulations, or performance figures are illustrative research outputs only.

No client or advisory relationship. We do not provide investment advisory, brokerage, portfolio-management, custody, or asset-management services to any person or entity. Browsing this site, using our tools, or contacting us does not create a client, fiduciary, or advisory relationship. We do not manage money on behalf of third parties and do not act as agents for any financial institution.

Research & education only. Content, datasets, backtests, charts, code, and software made available here are for informational and educational research. Materials may be incomplete, simulated, hypothetical, or derived from third-party sources that we do not control. Past performance, backtested results, and historical analyses are not indicative of future results. Market conditions change; models may fail; assumptions may be wrong. You are solely responsible for evaluating any information and for all decisions you make.

No responsibility or liability. To the fullest extent permitted by applicable law, QuantifiedTrader and its contributors disclaim all responsibility and liability for any loss, damage, cost, or expense—direct or indirect—arising from access to, use of, or reliance on this website, its content, or its tools. All materials are provided “as is” and “as available,” without warranties of any kind, whether express or implied, including but not limited to accuracy, completeness, fitness for a particular purpose, or non-infringement.

Non-commercial research sharing. This site does not aim to profit from the knowledge, tools, or datasets published here. Materials are shared for non-commercial research and learning, subject to applicable open-source or site terms where noted. We are a research collective, not a commercial product or service provider.

Contact. For questions about this notice, the site, or published research materials, contact support@quantedx.com. Correspondence is for administrative and research purposes only and does not constitute advice or create any professional obligation on our part.

© 2026 QuantifiedTrader. All rights reserved.