Diversified Stock Portfolio Using Clustering Analysis
S&P 500 portfolio construction using K-means clustering on risk/return features.
Overview
This project constructs diversified stock portfolios from the S&P 500 using unsupervised learning. Historical price data is used to compute risk and return features for each stock; K-means clustering groups stocks with similar behavior. Portfolio construction then selects top stocks by Sharpe ratio from each cluster to achieve diversification across clusters and validate performance via backtesting against the S&P 500 index.
Data & Features
Data sources: S&P 500 constituent list and historical stock prices (US equity data fetched via the pipeline). The first 70% of the history is used for model building; the remaining 30% is reserved for validation.
Features used for clustering (all derived from historical data):
Clustering Features
Correlation with S&P 500 index (price correlation).
Beta: sensitivity of stock returns to index returns.
Annualized return (from daily returns, 252-day year).
Annualized volatility (standard deviation of daily returns, annualized).
Sharpe ratio (annualized return / annualized volatility).
Daily change in price (open-to-close) and daily variation (high-to-low), annualized.
Features are scaled (z-score) before clustering. The optimal number of clusters is chosen using the within-cluster sum of squares (elbow method); we use K = 4.
Clustering
K-means is run on the normalized feature matrix with multiple random starts. Cluster membership ensures the portfolio spans different behavior groups rather than concentrating in one segment of the risk/return space.
Portfolio Construction
Two portfolio variants are built:
(1) Diversified by cluster: Within each cluster, stocks are ranked by Sharpe ratio; the top 5 from each of the 4 clusters form a 20-stock portfolio (equal weight per stock).
(2) Top-20 by Sharpe: The top 20 stocks by Sharpe ratio across the full universe, without cluster constraints.
Both are equal-weighted. The diversified-by-cluster portfolio reduces concentration in a single risk/return profile.
Validation & Backtesting
Validation uses the holdout period (last 30% of the data). Daily returns are computed for the portfolio (equal-weighted average of constituent returns) and for the S&P 500 index.
Cumulative returns are plotted for the cluster-based portfolio, the top-20 Sharpe portfolio, and the S&P 500 to compare risk-adjusted performance.
Results
Summary
Cluster statistics
| Cluster | Count | Mean return | Mean vol | Mean Sharpe | Mean beta |
|---|---|---|---|---|---|
| 1 | 40 | 28.60% | 30.70% | 0.94 | 0.99 |
| 2 | 91 | 7.20% | 25.40% | 0.30 | 0.81 |
| 3 | 20 | -4.20% | 53.70% | -0.07 | 1.69 |
| 4 | 47 | -3.80% | 30.20% | -0.12 | 0.79 |
Elbow plot: within-cluster sum of squares
What this shows: Within-cluster sum of squares (WSS) by candidate cluster count K.
How to read it: Look for the elbow where WSS improvement starts flattening; that point suggests a practical K before diminishing returns.
Return vs volatility by cluster
What this shows: Each point is a stock, plotted by annualized volatility (x-axis) and annualized return (y-axis), colored by assigned cluster.
How to read it: Tight, separated color groups indicate cleaner cluster structure; overlap suggests weaker separation in feature space.
Cluster-wise metrics
Mean annualized return, volatility, and Sharpe ratio by cluster.
Mean return
Mean volatility
Mean Sharpe ratio
Feature correlation matrix
Correlation between clustering features (used for K-means). Helps check redundancy.
| ann_return | ann_vol | ann_sharpe… | ann_daily_… | ann_daily_… | beta | cor | |
|---|---|---|---|---|---|---|---|
| ann_return | 1.00 | -0.15 | 0.94 | -0.68 | -0.12 | -0.10 | 0.63 |
| ann_vol | -0.15 | 1.00 | -0.27 | 0.29 | 0.97 | 0.82 | -0.16 |
| ann_sharpe… | 0.94 | -0.27 | 1.00 | -0.66 | -0.23 | -0.18 | 0.66 |
| ann_daily_… | -0.68 | 0.29 | -0.66 | 1.00 | 0.32 | 0.15 | -0.58 |
| ann_daily_… | -0.12 | 0.97 | -0.23 | 0.32 | 1.00 | 0.79 | -0.16 |
| beta | -0.10 | 0.82 | -0.18 | 0.15 | 0.79 | 1.00 | 0.14 |
| cor | 0.63 | -0.16 | 0.66 | -0.58 | -0.16 | 0.14 | 1.00 |
Portfolios
By cluster: top 5 by Sharpe in each of 4 clusters (equal weight).
Top 20: top 20 stocks by Sharpe ratio overall (equal weight).
Portfolio by cluster (symbols)
CEG, ANET, CTAS, AVGO, ACGL, BRK-B, CB, ED, EIX, ADP, FSLR, EQT, AMD, AMAT, CRWD, DVN, CF, CTRA, CTVA, AEE
Top 20 by Sharpe (symbols)
CEG, ANET, CTAS, AVGO, ACGL, COST, FICO, AFL, COR, AJG, ETN, CBOE, AZO, ABBV, XOM, APO, BLDR, BSX, BRO, ERIE
Validation: cumulative returns
What this shows: Out-of-sample cumulative return comparison of the cluster portfolio, top-20 Sharpe basket, and S&P 500 benchmark.
How to read it: Compare slope, drawdown phases, and final level to assess whether clustering adds robust value beyond simple ranking.