Hierarchical PCA and Modeling Asset Correlations
Abstract
This study implements Hierarchical Principal Component Analysis (HPCA) for modeling cross-sectional correlations among US large-cap equities, following Avellaneda and Serur (2020). Classical PCA extracts eigenportfolios from the full empirical correlation matrix, but higher-order components are often unstable and economically opaque — the *identification problem*. HPCA imposes a tree-like block structure: assets are grouped into sectors or data-driven clusters, each with its own first eigenportfolio factor, while cross-block residual correlations are set to zero. We compare three block definitions — fixed GICS sectors, statistical sign-pattern clusters from PCA eigenvectors, and K-means partitions of PCA loadings — and evaluate equal-weight cluster portfolios, long-short spreads, and PCA-versus-HPCA eigenvalue spectra on a sector-balanced S&P 500 panel. Results are descriptive and research-oriented; they illustrate how dynamic clustering can adapt HPCA to evolving correlation regimes without claiming live trading alpha.
Introduction: The Identification Problem in Factor Models
Principal Components Analysis (PCA) is widely used to extract common risk factors from the correlation matrix of standardized asset returns. The first eigenportfolio explains the maximum variance; recursively, additional orthogonal directions capture remaining structure. In matrix form, for correlation matrix , the leading eigenvector solves:
with eigenvalue satisfying . The Karhunen–Loève representation of standardized return is:
where are uncorrelated eigenportfolio returns with unit variance. A one-factor approximation gives with , but residuals are generally correlated across assets — especially when and belong to unrelated economic sectors.
Avellaneda and Serur (2020) argue that correlations between economically unrelated stocks (e.g. technology vs. energy) are noisy and difficult to estimate reliably — the identification problem. HPCA mitigates this by embedding economic or statistical partition information into the correlation model, producing interpretable sector factors and a positive semi-definite block-structured matrix .
Classical PCA and Correlation Estimation
Let denote the simple return of asset at time , and standardize to using the in-sample mean and volatility . The empirical correlation matrix is:
for a sample of observations (here, monthly returns aggregated from daily prices). PCA on yields eigenvalues and eigenvectors . Random matrix theory (Laloux et al., 2000) suggests discarding eigenvalues below the Marcenko–Pastur bound when separating signal from noise.
A -factor PCA model writes:
After defactoring, the residual correlation matrix should have a top eigenvalue consistent with pure noise if is adequate. In practice, is chosen by inspecting the scree plot or RMT cutoff.
The limitation for portfolio management is that rarely admit stable sector interpretations and fluctuate across estimation windows — motivating a model that uses partition information explicitly.
Hierarchical PCA: Block Structure and the HPCA Assumption
Consider a universe of assets partitioned into blocks , each containing assets. Define the block assignment function if asset belongs to block . Within each block, run PCA and retain the first eigenportfolio factor . For each asset :
where is the regression loading on the block's leading factor.
The HPCA assumption (Avellaneda & Serur, Eq. 10) states that residuals are uncorrelated across blocks:
Intra-block correlations for remain equal to the empirical correlations within the sector. Cross-block entries of the model correlation matrix are:
where is the correlation between block-level factors. Proposition 1 (Avellaneda & Serur): is symmetric, has unit diagonal, and is positive semi-definite — hence it is a valid correlation matrix for a multivariate model.
Constructing \(\tilde{R}\): Sector Factors, Betas, and Cross-Block Links
The empirical pipeline for each block proceeds as follows. Let be the matrix of standardized returns for assets in block (stocks as rows, time as columns). Compute the first principal component:
The block factor return series is the eigenportfolio:
For each asset in block , the loading is , estimated by Pearson correlation over the sample window.
Cross-block factor correlations are computed from the time series and . The full matrix is assembled block-by-block: diagonal blocks copy the empirical ; off-diagonal blocks are rank-one outer products .
Geometrically, HPCA induces a tree structure: a root (full market), branches (sectors or clusters), and leaves (individual stocks). A two-layer model uses sectors only; extensions add sub-sectors, underlyings for derivatives, or obligor groupings in credit — mathematically identical, with deeper trees.
The correlation heatmaps and sector-factor loading table below compare empirical with sector-based on a twelve-name subset (highest ), illustrating how cross-block entries compress while intra-block structure is preserved.
Spectral Properties: Eigenvalues and Eigenvectors of \(\tilde{R}\)
A key theoretical result (Proposition 2, Avellaneda & Serur) is that HPCA eigenstructure decomposes cleanly across blocks. Let and be the -th eigenvalue and eigenvector of block- correlation matrix . Embed sector eigenvectors into :
The vectors form an orthogonal basis of . The subspace generated by the first eigenvector of each block is invariant under .
Consequently, the leading market-wide modes of are linear combinations of sector first factors — economically interpretable as sector rotations rather than opaque higher-order eigenportfolios. The bar chart below compares eigenvalue spectra of (classical PCA) and (sector HPCA), reporting Frobenius distance and mean absolute off-diagonal gap .
Dynamic Clustering: Statistical and K-Means Partitions
Fixed GICS sectors provide economically meaningful blocks but may not align with time-varying correlation structure. Avellaneda and Serur introduce statistical clustering to identify homogeneous groups of stocks sharing common risk-factor sign patterns. Two approaches are implemented here.
Statistical (sign-pattern) clustering. Fit PCA on with components. Omit the first (market) eigenvector and retain eigenvectors . For each asset , form the sign vector . Assets with identical sign patterns belong to the same cluster:
Clusters are re-labelled to consecutive integers. HPCA is then applied with instead of the GICS sector.
K-means on PCA loadings. Extract the first principal components of , forming the loading matrix . Apply K-means:
with clusters and centroids . This groups assets by proximity in factor-exposure space before block-wise HPCA estimation.
Both methods produce synthetic sectors that adapt to the current correlation matrix, enabling rolling re-estimation as market structure evolves. Portfolio results for each method appear in the sections below.
Empirical Results: Statistical Sign-Pattern Clustering
After assigning stocks to sign-pattern clusters, HPCA is re-estimated on the dynamic blocks. Equal-weight cluster portfolios are formed and cumulative wealth is tracked. The long-short spread contrasts the highest- and lowest-Sharpe clusters, benchmarked against SPY.
The charts below show cluster cumulative returns, the long-short spread versus the benchmark, and per-cluster Sharpe ratios. Compare sector diversity within each cluster — homogeneous sign-pattern groups may span multiple GICS industries when factor exposures align.
Empirical Results: K-Means on PCA Loadings
K-means partitions assets in the space of the first PCA loadings before HPCA block estimation. Portfolio construction follows the same equal-weight and long-short framework as the statistical clustering section.
Because K-means groups by Euclidean distance in factor-loading space rather than discrete sign identity, clusters may differ substantially from sign-pattern partitions — particularly when loadings are continuous rather than binary. The exhibits below allow direct comparison of cluster return paths and Sharpe rankings.
Portfolio Construction, Risk Metrics, and Benchmark Comparison
For each clustering scheme , define the equal-weight cluster portfolio return at month :
where . Annualized performance statistics (assuming 12 months per year) are:
A long-short spread portfolio goes long the highest-Sharpe cluster and short the lowest-Sharpe cluster:
These metrics are reported in the statistical and K-means result sections above. All figures are in-sample descriptive statistics benchmarked against SPY; they do not incorporate transaction costs or shorting constraints.
Data and Empirical Design
The investable universe is a sector-balanced sample of S&P 500 constituents: approximately five names per GICS sector, selected with a fixed random seed for reproducibility. Daily adjusted closes from Yahoo Finance (2015 onward) aggregate to monthly simple returns . SPY serves as the broad-market benchmark.
Three HPCA variants are estimated on the same return panel:
- Sector HPCA: blocks defined by GICS industry labels (baseline, analogous to NAICS sectors in CRSP-based studies).
- Statistical HPCA: blocks from sign-pattern clustering on PCA eigenvectors ( components).
- K-means HPCA: blocks from K-means on the first 4 PCA loadings ( clusters).
For each variant, the pipeline computes , cluster portfolio returns, Sharpe ratios, long-short spreads, and eigenvalue spectra. The PCA-versus-HPCA comparison uses the sector-based against the classical empirical .
Limitations and Caveats
HPCA assumes uncorrelated residuals across blocks — a deliberate sparsity restriction that improves interpretability but may understate true cross-sector linkages during systemic crises when all correlations rise toward unity.
Estimation uses a single in-sample window; rolling or expanding re-estimation would better capture regime shifts but increases computational cost and introduces look-ahead considerations if not carefully implemented.
Yahoo Finance data and GICS labels differ from the CRSP/NAICS panel in the original Avellaneda & Serur (2020) study. Results illustrate methodology rather than replicate published CRSP numbers exactly.
Cluster Sharpe ratios and long-short spreads are in-sample descriptive statistics. They do not account for transaction costs, shorting constraints, or multiple-testing correction across cluster partitions.
This document is research output for education and quantitative discussion. It is not investment advice.
Conclusion
Hierarchical PCA offers a principled resolution to the PCA identification problem in large equity universes. By partitioning assets into economically or statistically meaningful blocks and imposing zero cross-block residual correlation, the model produces a valid, interpretable correlation matrix with an explicit tree structure.
Three lessons emerge from this empirical study. First, the eigenvalue spectrum of diverges measurably from classical , confirming that the block constraint is not a cosmetic relabelling — it reallocates variance across modes in a way that favours sector-factor interpretability. Second, sector factor loadings vary widely within blocks, reminding us that HPCA retains full intra-block empirical richness while only sparsifying cross-block entries. Third, dynamic clustering — whether by sign patterns or K-means on loadings — allows HPCA to adapt as correlation geometry shifts, extending the static sector framework of Avellaneda and Serur (2020) to time-varying market structure.
For portfolio managers, the framework supports risk decomposition (which sector factors drive co-movement?), correlation forecasting under structural priors, and cluster-based portfolio construction. The long-short spreads and Sharpe rankings reported here are descriptive summaries of in-sample cluster differentiation, not validated trading strategies. Future work should examine rolling re-estimation, out-of-sample correlation forecast accuracy, and integration with mean-variance or risk-parity optimisers under .
The summary statistics below synthesise the latest estimation run.
References
Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Modeling Asset Correlations. SSRN Working Paper 3903460. [https://ssrn.com/abstract=3903460](https://ssrn.com/abstract=3903460)
Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Applications to Portfolio Management. *Revista Mexicana de Economía y Finanzas*, 15(1), 1–18.
Avellaneda, M., & Lee, J.-H. (2010). Statistical arbitrage in the U.S. equities market. *Quantitative Finance*, 10(7), 761–782.
Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (2000). Random matrix theory and financial correlations. *International Journal of Theoretical and Applied Finance*, 3(3), 391–397.
Jolliffe, I. T. (2002). *Principal Component Analysis* (2nd ed.). Springer.