Hierarchical PCA and Modeling Asset Correlations

Abstract

This study implements Hierarchical Principal Component Analysis (HPCA) for modeling cross-sectional correlations among US large-cap equities, following Avellaneda and Serur (2020). Classical PCA extracts eigenportfolios from the full empirical correlation matrix, but higher-order components are often unstable and economically opaque — the *identification problem*. HPCA imposes a tree-like block structure: assets are grouped into sectors or data-driven clusters, each with its own first eigenportfolio factor, while cross-block residual correlations are set to zero. We compare three block definitions — fixed GICS sectors, statistical sign-pattern clusters from PCA eigenvectors, and K-means partitions of PCA loadings — and evaluate equal-weight cluster portfolios, long-short spreads, and PCA-versus-HPCA eigenvalue spectra on a sector-balanced S&P 500 panel. Results are descriptive and research-oriented; they illustrate how dynamic clustering can adapt HPCA to evolving correlation regimes without claiming live trading alpha.

Loading results…

Introduction: The Identification Problem in Factor Models

Principal Components Analysis (PCA) is widely used to extract common risk factors from the correlation matrix of standardized asset returns. The first eigenportfolio explains the maximum variance; recursively, additional orthogonal directions capture remaining structure. In matrix form, for correlation matrix $R \in R^{n \times n}$ , the leading eigenvector solves:

$V^{(1)} = ar g ∥ V ∥ = 1 max V^{⊤} R V$

with eigenvalue $λ^{(1)}$ satisfying $R V^{(1)} = λ^{(1)} V^{(1)}$ . The Karhunen–Loève representation of standardized return $X_{j}$ is:

$X_{j} = k = 1 \sum n λ^{(k)} V_{j}^{(k)} F^{(k)}, F^{(k)} = \frac{1}{λ ^{(k)}} i = 1 \sum n V_{i}^{(k)} X_{i}$

where $F^{(k)}$ are uncorrelated eigenportfolio returns with unit variance. A one-factor approximation gives $X_{j} = β_{j} F^{(1)} + ϵ_{j}$ with $β_{j} = λ^{(1)} V_{j}^{(1)}$ , but residuals $ϵ_{i}, ϵ_{j}$ are generally correlated across assets — especially when $i$ and $j$ belong to unrelated economic sectors.

Avellaneda and Serur (2020) argue that correlations between economically unrelated stocks (e.g. technology vs. energy) are noisy and difficult to estimate reliably — the identification problem. HPCA mitigates this by embedding economic or statistical partition information into the correlation model, producing interpretable sector factors and a positive semi-definite block-structured matrix $\tilde{R}$ .

Classical PCA and Correlation Estimation

Let $r_{i, t}$ denote the simple return of asset $i$ at time $t$ , and standardize to $X_{i, t} = (r_{i, t} - μ_{i}) / σ_{i}$ using the in-sample mean $μ_{i}$ and volatility $σ_{i}$ . The empirical correlation matrix is:

$R_{ij} = Corr (X_{i}, X_{j}) = \frac{\sum _{t} X _{i, t} X _{j, t}}{T - 1}$

for a sample of $T$ observations (here, monthly returns aggregated from daily prices). PCA on $R$ yields eigenvalues $λ^{(1)} \geq λ^{(2)} \geq \dots \geq λ^{(n)}$ and eigenvectors $V^{(k)}$ . Random matrix theory (Laloux et al., 2000) suggests discarding eigenvalues below the Marcenko–Pastur bound $λ^{+, MP} = (1 + n / T)^{2}$ when separating signal from noise.

A $m$ -factor PCA model writes:

$X_{j} = k = 1 \sum m β_{j}^{(k)} F^{(k)} + ϵ_{j}$

After defactoring, the residual correlation matrix $R_{ij}^{(m)} = Corr (ϵ_{i}, ϵ_{j})$ should have a top eigenvalue consistent with pure noise if $m$ is adequate. In practice, $m$ is chosen by inspecting the scree plot or RMT cutoff.

The limitation for portfolio management is that $V^{(2)}, V^{(3)}, \dots$ rarely admit stable sector interpretations and fluctuate across estimation windows — motivating a model that uses partition information explicitly.

Hierarchical PCA: Block Structure and the HPCA Assumption

Consider a universe of $n$ assets partitioned into $b$ blocks $k = 1, \dots, b$ , each containing $n_{k}$ assets. Define the block assignment function $I (j) = k$ if asset $j$ belongs to block $k$ . Within each block, run PCA and retain the first eigenportfolio factor $F^{(1, k)}$ . For each asset $j$ :

$X_{j} = β_{j} F^{(1, I (j))} + ϵ_{j}$

where $β_{j} = Corr (X_{j}, F^{(1, I (j))})$ is the regression loading on the block's leading factor.

The HPCA assumption (Avellaneda & Serur, Eq. 10) states that residuals are uncorrelated across blocks:

$If I (i) \neq = I (j), then Corr (ϵ_{i}, ϵ_{j}) = 0$

Intra-block correlations $R_{ij}$ for $I (i) = I (j)$ remain equal to the empirical correlations within the sector. Cross-block entries of the model correlation matrix are:

$\tilde{R}_{ij} = {R_{ij} β_{i} β_{j} \overset{ρ}{ˉ}^{I (i) I (j)} if I (i) = I (j) if I (i) \neq = I (j)$

where $\overset{ρ}{ˉ}^{k k^{'}} = Corr (F^{(1, k)}, F^{(1, k^{'})})$ is the correlation between block-level factors. Proposition 1 (Avellaneda & Serur): $\tilde{R}$ is symmetric, has unit diagonal, and is positive semi-definite — hence it is a valid correlation matrix for a multivariate model.

Constructing $\tilde{R}$: Sector Factors, Betas, and Cross-Block Links

The empirical pipeline for each block $k$ proceeds as follows. Let $X^{(k)} \in R^{T \times n_{k}}$ be the matrix of standardized returns for assets in block $k$ (stocks as rows, time as columns). Compute the first principal component:

$V^{(1, k)} = ar g ∥ V ∥ = 1 max V^{⊤} R^{(k)} V, λ^{(1, k)} = leading eigenvalue of R^{(k)}$

The block factor return series is the eigenportfolio:

$F_{t}^{(1, k)} = \frac{1}{λ ^{(1, k)}} j : I (j) = k \sum V_{j}^{(1, k)} X_{j, t}$

For each asset $j$ in block $k$ , the loading is $β_{j} = Corr (X_{j}, F^{(1, k)})$ , estimated by Pearson correlation over the sample window.

Cross-block factor correlations $\overset{ρ}{ˉ}^{k k^{'}}$ are computed from the time series ${F_{t}^{(1, k)}}$ and ${F_{t}^{(1, k^{'})}}$ . The full $n \times n$ matrix $\tilde{R}$ is assembled block-by-block: diagonal blocks copy the empirical $R^{(k)}$ ; off-diagonal blocks are rank-one outer products $β_{i} β_{j} \overset{ρ}{ˉ}^{I (i) I (j)}$ .

Geometrically, HPCA induces a tree structure: a root (full market), branches (sectors or clusters), and leaves (individual stocks). A two-layer model uses sectors only; extensions add sub-sectors, underlyings for derivatives, or obligor groupings in credit — mathematically identical, with deeper trees.

The correlation heatmaps and sector-factor loading table below compare empirical $R$ with sector-based $\tilde{R}$ on a twelve-name subset (highest $∣ β ∣$ ), illustrating how cross-block entries compress while intra-block structure is preserved.

Loading results…

Spectral Properties: Eigenvalues and Eigenvectors of $\tilde{R}$

A key theoretical result (Proposition 2, Avellaneda & Serur) is that HPCA eigenstructure decomposes cleanly across blocks. Let $λ^{(i, k)}$ and $V^{(i, k)}$ be the $i$ -th eigenvalue and eigenvector of block- $k$ correlation matrix $R^{(k)}$ . Embed sector eigenvectors into $R^{n}$ :

$W_{j}^{(i, k)} = {V_{j}^{(i, k)} 0 if I (j) = k if I (j) \neq = k$

The vectors $W^{(i, k)}$ form an orthogonal basis of $R^{n}$ . The subspace $Ω = span {W^{(1, 1)}, \dots, W^{(1, b)}}$ generated by the first eigenvector of each block is invariant under $\tilde{R}$ .

Consequently, the leading market-wide modes of $\tilde{R}$ are linear combinations of sector first factors — economically interpretable as sector rotations rather than opaque higher-order eigenportfolios. The bar chart below compares eigenvalue spectra of $R$ (classical PCA) and $\tilde{R}$ (sector HPCA), reporting Frobenius distance $∥ R - \tilde{R} ∥_{F}$ and mean absolute off-diagonal gap $∣Δ ρ ∣$ .

Loading results…

Dynamic Clustering: Statistical and K-Means Partitions

Fixed GICS sectors provide economically meaningful blocks but may not align with time-varying correlation structure. Avellaneda and Serur introduce statistical clustering to identify homogeneous groups of stocks sharing common risk-factor sign patterns. Two approaches are implemented here.

Statistical (sign-pattern) clustering. Fit PCA on $R$ with $K$ components. Omit the first (market) eigenvector and retain eigenvectors $V^{(2)}, \dots, V^{(K)}$ . For each asset $j$ , form the sign vector $sign (V_{j}^{(2)}, \dots, V_{j}^{(K)})$ . Assets with identical sign patterns belong to the same cluster:

$c_{j} = ℓ = 2 \sum K 2^{ℓ - 2} \cdot sign (V_{j}^{(ℓ)})$

Clusters are re-labelled to consecutive integers. HPCA is then applied with $I (j) = c_{j}$ instead of the GICS sector.

K-means on PCA loadings. Extract the first $K^{'}$ principal components of $R$ , forming the loading matrix $Λ \in R^{n \times K^{'}}$ . Apply K-means:

$c_{j} = ar g c \in {1, \dots, C} min ∥ Λ_{j} - μ_{c} ∥^{2}$

with $C$ clusters and centroids $μ_{c}$ . This groups assets by proximity in factor-exposure space before block-wise HPCA estimation.

Both methods produce synthetic sectors that adapt to the current correlation matrix, enabling rolling re-estimation as market structure evolves. Portfolio results for each method appear in the sections below.

Empirical Results: Statistical Sign-Pattern Clustering

After assigning stocks to sign-pattern clusters, HPCA is re-estimated on the dynamic blocks. Equal-weight cluster portfolios $r_{c, t}$ are formed and cumulative wealth $W_{c, T} = \prod_{t} (1 + r_{c, t})$ is tracked. The long-short spread $r_{L S, t} = r_{c^{*}, t} - r_{c_{*}, t}$ contrasts the highest- and lowest-Sharpe clusters, benchmarked against SPY.

The charts below show cluster cumulative returns, the long-short spread versus the benchmark, and per-cluster Sharpe ratios. Compare sector diversity within each cluster — homogeneous sign-pattern groups may span multiple GICS industries when factor exposures align.

Loading results…

Empirical Results: K-Means on PCA Loadings

K-means partitions assets in the space of the first $K^{'}$ PCA loadings before HPCA block estimation. Portfolio construction follows the same equal-weight and long-short framework as the statistical clustering section.

Because K-means groups by Euclidean distance in factor-loading space rather than discrete sign identity, clusters may differ substantially from sign-pattern partitions — particularly when loadings are continuous rather than binary. The exhibits below allow direct comparison of cluster return paths and Sharpe rankings.

Loading results…

Portfolio Construction, Risk Metrics, and Benchmark Comparison

For each clustering scheme $c \in {1, \dots, C}$ , define the equal-weight cluster portfolio return at month $t$ :

$r_{c, t} = \frac{1}{∣ N _{c} ∣} i \in N_{c} \sum r_{i, t}$

where $N_{c} = {i : I (i) = c}$ . Annualized performance statistics (assuming 12 months per year) are:

$μ_{c}^{ann} = 12 \cdot \overset{r}{ˉ}_{c}, σ_{c}^{ann} = 12 \cdot std (r_{c}), Sharpe_{c} = \frac{μ _{c}^{ann}}{σ _{c}^{ann}}$

A long-short spread portfolio goes long the highest-Sharpe cluster and short the lowest-Sharpe cluster:

$r_{L S, t} = r_{c^{*}, t} - r_{c_{*}, t}, c^{*} = ar g c max Sharpe_{c}, c_{*} = ar g c min Sharpe_{c}$

These metrics are reported in the statistical and K-means result sections above. All figures are in-sample descriptive statistics benchmarked against SPY; they do not incorporate transaction costs or shorting constraints.

Data and Empirical Design

The investable universe is a sector-balanced sample of S&P 500 constituents: approximately five names per GICS sector, selected with a fixed random seed for reproducibility. Daily adjusted closes from Yahoo Finance (2015 onward) aggregate to monthly simple returns $r_{i, t}$ . SPY serves as the broad-market benchmark.

Three HPCA variants are estimated on the same return panel:

Sector HPCA: blocks defined by GICS industry labels (baseline, analogous to NAICS sectors in CRSP-based studies).
Statistical HPCA: blocks from sign-pattern clustering on PCA eigenvectors ( $K = 8$ components).
K-means HPCA: blocks from K-means on the first 4 PCA loadings ( $C = 8$ clusters).

For each variant, the pipeline computes $\tilde{R}$ , cluster portfolio returns, Sharpe ratios, long-short spreads, and eigenvalue spectra. The PCA-versus-HPCA comparison uses the sector-based $\tilde{R}$ against the classical empirical $R$ .

Limitations and Caveats

HPCA assumes uncorrelated residuals across blocks — a deliberate sparsity restriction that improves interpretability but may understate true cross-sector linkages during systemic crises when all correlations rise toward unity.

Estimation uses a single in-sample window; rolling or expanding re-estimation would better capture regime shifts but increases computational cost and introduces look-ahead considerations if not carefully implemented.

Yahoo Finance data and GICS labels differ from the CRSP/NAICS panel in the original Avellaneda & Serur (2020) study. Results illustrate methodology rather than replicate published CRSP numbers exactly.

Cluster Sharpe ratios and long-short spreads are in-sample descriptive statistics. They do not account for transaction costs, shorting constraints, or multiple-testing correction across cluster partitions.

This document is research output for education and quantitative discussion. It is not investment advice.

Conclusion

Hierarchical PCA offers a principled resolution to the PCA identification problem in large equity universes. By partitioning assets into economically or statistically meaningful blocks and imposing zero cross-block residual correlation, the model produces a valid, interpretable correlation matrix $\tilde{R}$ with an explicit tree structure.

Three lessons emerge from this empirical study. First, the eigenvalue spectrum of $\tilde{R}$ diverges measurably from classical $R$ , confirming that the block constraint is not a cosmetic relabelling — it reallocates variance across modes in a way that favours sector-factor interpretability. Second, sector factor loadings $β_{j}$ vary widely within blocks, reminding us that HPCA retains full intra-block empirical richness while only sparsifying cross-block entries. Third, dynamic clustering — whether by sign patterns or K-means on loadings — allows HPCA to adapt as correlation geometry shifts, extending the static sector framework of Avellaneda and Serur (2020) to time-varying market structure.

For portfolio managers, the framework supports risk decomposition (which sector factors drive co-movement?), correlation forecasting under structural priors, and cluster-based portfolio construction. The long-short spreads and Sharpe rankings reported here are descriptive summaries of in-sample cluster differentiation, not validated trading strategies. Future work should examine rolling re-estimation, out-of-sample correlation forecast accuracy, and integration with mean-variance or risk-parity optimisers under $\tilde{R}$ .

The summary statistics below synthesise the latest estimation run.

Loading results…

References

Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Modeling Asset Correlations. SSRN Working Paper 3903460. [https://ssrn.com/abstract=3903460](https://ssrn.com/abstract=3903460)

Avellaneda, M., & Serur, J. E. (2020). Hierarchical PCA and Applications to Portfolio Management. *Revista Mexicana de Economía y Finanzas*, 15(1), 1–18.

Avellaneda, M., & Lee, J.-H. (2010). Statistical arbitrage in the U.S. equities market. *Quantitative Finance*, 10(7), 761–782.

Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (2000). Random matrix theory and financial correlations. *International Journal of Theoretical and Applied Finance*, 3(3), 391–397.

Jolliffe, I. T. (2002). *Principal Component Analysis* (2nd ed.). Springer.

Hierarchical PCA and Modeling Asset Correlations

Abstract

Introduction: The Identification Problem in Factor Models

Classical PCA and Correlation Estimation

Hierarchical PCA: Block Structure and the HPCA Assumption

Constructing \(\tilde{R}\): Sector Factors, Betas, and Cross-Block Links

Spectral Properties: Eigenvalues and Eigenvectors of \(\tilde{R}\)

Dynamic Clustering: Statistical and K-Means Partitions

Empirical Results: Statistical Sign-Pattern Clustering

Empirical Results: K-Means on PCA Loadings

Portfolio Construction, Risk Metrics, and Benchmark Comparison

Data and Empirical Design

Limitations and Caveats

Conclusion

References