Nifty 50 Alpha101 Selection & Composite Factor Research
Summary
This study asks whether classic formulaic alphas — compact rules built from open, high, low, close, and volume — still separate winners from losers among Nifty 50 constituents over a long daily history of NSE-listed prices.
The workflow has three layers: (1) signal engineering on a daily panel of roughly fifty NSE stocks; (2) statistical evaluation via cross-sectional information coefficients and quintile spreads; (3) factor combination with both a transparent rolling linear model and an optional gradient-boosted machine-learning overlay.
Results are descriptive research outputs. They illustrate how multi-factor selection behaves on Indian large caps over the sample window; they are not live trading recommendations.
Research objectives
Measure predictive content of each alpha using daily Spearman IC and ICIR (mean IC divided by IC volatility).
Identify alphas stable enough to feed a composite score that ranks stocks each day.
Simulate long–short and quintile portfolios to see whether ranked signals translate into monotonic return patterns.
Compare a linear composite (rolling OLS with orthogonalized inputs) against a walk-forward LightGBM model trained on the same cleaned features.
Universe and market data
The investable set is the Nifty 50 index: liquid names traded on the National Stock Exchange of India. The sample focuses on current index members with sufficient listing history.
Daily adjusted open, high, low, close, and volume span from 2015 onward. Names with fewer than about five hundred trading days are dropped so rolling windows (20-day volatility, 60-day training) are well defined.
Market capitalisation at each rebalance supports size neutralisation. VWAP in volume-sensitive alphas is approximated as the typical price (average of high, low, and close) when exchange turnover detail is unavailable.
How the study is conducted
Step 1 — Panel construction. Prices are organised by trade date and stock. Forward one-day returns are the percentage change from today’s close to the next session’s close, aligned to today for prediction (no look-ahead in labels).
Step 2 — Alpha computation. Fifteen formulaic signals inspired by the WorldQuant Alpha101 library are evaluated in wide matrix form (dates × stocks), then stacked. Examples include short-horizon reversal, rank correlations of open and volume, VWAP distance, and close–open range ratios.
Step 3 — Cross-sectional cleaning. For each date independently: replace infinities; MAD winsorise (median ± 3× scaled MAD); z-score; regress each factor on log market cap and take residuals; z-score again; fill remaining gaps with zero.
Step 4 — Single-factor diagnostics. Every cleaned alpha is correlated with next-day returns across stocks (Spearman IC). Quintile portfolios sort stocks into five equal buckets; cumulative NAV paths and G5−G1 spreads are recorded.
Step 5 — Linear composite. Alphas passing IC screens (or, if none pass, the top three by |ICIR|) enter a 60-day rolling OLS. Inputs are Löwdin-orthogonalised each day; the regression target is the cross-sectional percentile rank of forward return so the model learns relative ordering, not market beta.
Step 6 — Machine-learning overlay (optional). The same cleaned features feed a walk-forward LightGBM regressor with embargo between validation and test windows. Out-of-sample scores become an ML alpha; feature importance and mean |SHAP| values summarise drivers.
How signals are evaluated
Information Coefficient (IC). Each trading day, factor values and realised next-day returns are ranked across stocks; IC is the Spearman correlation of those ranks. A positive IC means higher factor scores tended to precede higher returns.
ICIR. Mean daily IC divided by its standard deviation. Higher |ICIR| suggests a more stable signal; very low ICIR often means noise dominates.
Selection rule. Strict inclusion requires |mean IC| > 2% and |ICIR| > 0.30. On Nifty 50 these thresholds are demanding; when no factor qualifies, the composite uses the three alphas with the largest |ICIR| and |mean IC| ≥ 1%.
Layered backtest. Stocks are sorted into quintiles G1 (lowest factor) through G5 (highest). Each bucket is equal-weighted; the long–short leg is G5 minus G1. Reported metrics: cumulative return, annualised return and volatility, Sharpe ratio, and maximum drawdown — before brokerage, STT, or borrow costs.
Machine-learning extension
The gradient-boosted model uses L1 regression (robust to return outliers), shallow trees, column subsampling, and early stopping on a validation slice inside each walk-forward fold.
Training windows span roughly twelve months; validation three months; test three months; one-day embargo between validation end and test start to limit label leakage from overlapping forward returns.
The ML alpha is evaluated with the same IC and quintile machinery as single-factor alphas. Feature-importance bars show average gain across folds; SHAP summaries compress the beeswarm view into mean absolute contribution per feature on the last test sample.
Limitations and interpretation
Fifty stocks is a narrow cross-section compared with CSI 300 or full NSE universes; IC estimates are noisier and quintile buckets hold fewer names.
Published closing prices may differ slightly from exchange official figures; corporate actions are handled via standard price adjustment.
Industry neutralisation is not applied — only log market cap — because sector classifications are not part of this study.
Backtests are paper portfolios without Indian transaction taxes, stamp duty, or short borrow; live implementation would reduce net spreads.
In-sample factor mining across fifteen correlated alphas increases multiple-testing risk; treat significance as exploratory.
Interactive results (below)
The screening table lists IC mean, IC standard deviation, ICIR, and composite membership for every alpha.
Use the factor tabs to inspect daily IC (bars) and cumulative IC (line), quintile NAV curves, and the full performance table for any signal.
The synthetic factor section shows the rolling OLS composite; the ML alpha section appears when the boost model ran successfully in the latest build.
Empirical results
Interactive diagnostics from the latest study. Time series are downsampled for display; performance tables use the full sample.
Study snapshot
| Universe | Nifty 50 — 49 stocks with sufficient history |
| Sample period | 2015-01-01 to 2026-05-25 |
| Alphas computed | 15 |
| OLS composite inputs | alpha038, alpha101, alpha_5_day_reversal (fallback top icir) |
| Market | India — NSE-listed equities |
| ML walk-forward | 39 out-of-sample folds |
Factor screening
Cross-sectional Spearman IC diagnostics for each cleaned alpha. Composite membership follows strict IC/ICIR thresholds or a ranked fallback when no factor clears the bar.
| Factor | IC mean | IC std | ICIR | In composite |
|---|---|---|---|---|
| alpha038 | 0.0340 | 0.1890 | 0.178 | yes |
| alpha101 | -0.0280 | 0.1800 | -0.157 | yes |
| alpha_5_day_reversal | -0.0270 | 0.2000 | -0.134 | yes |
| alpha041 | 0.0220 | 0.1700 | 0.127 | — |
| alpha001 | -0.0210 | 0.1690 | -0.121 | — |
| alpha006 | 0.0160 | 0.1590 | 0.100 | — |
| alpha042 | 0.0190 | 0.1940 | 0.095 | — |
| alpha040 | 0.0140 | 0.1670 | 0.082 | — |
| alpha094 | 0.0140 | 0.1790 | 0.080 | — |
| alpha002 | 0.0100 | 0.1550 | 0.064 | — |
| alpha012 | 0.0090 | 0.1550 | 0.059 | — |
| alpha072 | 0.0080 | 0.1530 | 0.052 | — |
| alpha088 | 0.0060 | 0.1500 | 0.039 | — |
| alpha003 | 0.0030 | 0.1490 | 0.019 | — |
| alpha098 | -0.0010 | 0.1500 | -0.008 | — |
Single-factor diagnostics
Select an alpha to view its information coefficient path and quintile backtest. G1 is the lowest factor bucket; G5 the highest; L-S is the top-minus-bottom spread.
Information coefficient —
Quintile cumulative NAV —
Rolling linear composite
Multi-alpha score from 60-day rolling OLS on orthogonalised inputs. Training labels are cross-sectional ranks of next-day return.
Information coefficient — composite
Quintile cumulative NAV — composite
| Bucket | Cum Return | Ann Return | Ann Vol | Sharpe | Max DD |
|---|---|---|---|---|---|
| G1 | 124.50% | 7.70% | 19.00% | 0.063 | -49.70% |
| G2 | 252.00% | 12.20% | 17.90% | 0.320 | -47.60% |
| G3 | 1111.50% | 25.70% | 17.90% | 1.073 | -37.40% |
| G4 | 1285.30% | 27.20% | 17.90% | 1.156 | -35.80% |
| G5 | 889.10% | 23.40% | 18.90% | 0.893 | -36.50% |
| L-S | 289.90% | 13.30% | 14.70% | 0.461 | -32.00% |
Machine-learning alpha
LightGBM walk-forward model trained on the same cleaned alphas. Out-of-sample predictions are ranked and backtested like single-factor signals.
IC mean 0.0240 · ICIR 0.143 · 39 folds
Information coefficient — ML alpha
Quintile cumulative NAV — ML alpha
| Bucket | Cum Return | Ann Return | Ann Vol | Sharpe | Max DD |
|---|---|---|---|---|---|
| G1 | 221.40% | 12.70% | 18.00% | 0.346 | -42.10% |
| G2 | 291.60% | 15.00% | 18.20% | 0.469 | -38.90% |
| G3 | 573.60% | 21.60% | 17.70% | 0.854 | -37.20% |
| G4 | 930.40% | 27.00% | 18.10% | 1.137 | -38.00% |
| G5 | 874.30% | 26.30% | 18.90% | 1.046 | -44.00% |
| L-S | 182.10% | 11.20% | 13.70% | 0.344 | -22.20% |
Feature importance (mean gain across folds)
SHAP contribution (mean absolute value)