Nifty 50 Alpha101 Selection & Composite Factor Research

← Back to India projects All projects

Indian large caps Price-volume alphas Cross-sectional cleaning Information coefficient screening Linear factor combination Machine-learning composite Quintile portfolio backtests Nifty 50 constituents

Summary

This study asks whether classic formulaic alphas — compact rules built from open, high, low, close, and volume — still separate winners from losers among Nifty 50 constituents over a long daily history of NSE-listed prices.

The workflow has three layers: (1) signal engineering on a daily panel of roughly fifty NSE stocks; (2) statistical evaluation via cross-sectional information coefficients and quintile spreads; (3) factor combination with both a transparent rolling linear model and an optional gradient-boosted machine-learning overlay.

Results are descriptive research outputs. They illustrate how multi-factor selection behaves on Indian large caps over the sample window; they are not live trading recommendations.

Research objectives

Measure predictive content of each alpha using daily Spearman IC and ICIR (mean IC divided by IC volatility).

Identify alphas stable enough to feed a composite score that ranks stocks each day.

Simulate long–short and quintile portfolios to see whether ranked signals translate into monotonic return patterns.

Compare a linear composite (rolling OLS with orthogonalized inputs) against a walk-forward LightGBM model trained on the same cleaned features.

Universe and market data

The investable set is the Nifty 50 index: liquid names traded on the National Stock Exchange of India. The sample focuses on current index members with sufficient listing history.

Daily adjusted open, high, low, close, and volume span from 2015 onward. Names with fewer than about five hundred trading days are dropped so rolling windows (20-day volatility, 60-day training) are well defined.

Market capitalisation at each rebalance supports size neutralisation. VWAP in volume-sensitive alphas is approximated as the typical price (average of high, low, and close) when exchange turnover detail is unavailable.

How the study is conducted

Step 1 — Panel construction. Prices are organised by trade date and stock. Forward one-day returns are the percentage change from today’s close to the next session’s close, aligned to today for prediction (no look-ahead in labels).

Step 2 — Alpha computation. Fifteen formulaic signals inspired by the WorldQuant Alpha101 library are evaluated in wide matrix form (dates × stocks), then stacked. Examples include short-horizon reversal, rank correlations of open and volume, VWAP distance, and close–open range ratios.

Step 3 — Cross-sectional cleaning. For each date independently: replace infinities; MAD winsorise (median ± 3× scaled MAD); z-score; regress each factor on log market cap and take residuals; z-score again; fill remaining gaps with zero.

Step 4 — Single-factor diagnostics. Every cleaned alpha is correlated with next-day returns across stocks (Spearman IC). Quintile portfolios sort stocks into five equal buckets; cumulative NAV paths and G5−G1 spreads are recorded.

Step 5 — Linear composite. Alphas passing IC screens (or, if none pass, the top three by |ICIR|) enter a 60-day rolling OLS. Inputs are Löwdin-orthogonalised each day; the regression target is the cross-sectional percentile rank of forward return so the model learns relative ordering, not market beta.

Step 6 — Machine-learning overlay (optional). The same cleaned features feed a walk-forward LightGBM regressor with embargo between validation and test windows. Out-of-sample scores become an ML alpha; feature importance and mean |SHAP| values summarise drivers.

How signals are evaluated

Information Coefficient (IC). Each trading day, factor values and realised next-day returns are ranked across stocks; IC is the Spearman correlation of those ranks. A positive IC means higher factor scores tended to precede higher returns.

ICIR. Mean daily IC divided by its standard deviation. Higher |ICIR| suggests a more stable signal; very low ICIR often means noise dominates.

Layered backtest. Stocks are sorted into quintiles G1 (lowest factor) through G5 (highest). Each bucket is equal-weighted; the long–short leg is G5 minus G1. Reported metrics: cumulative return, annualised return and volatility, Sharpe ratio, and maximum drawdown — before brokerage, STT, or borrow costs.

Machine-learning extension

The gradient-boosted model uses L1 regression (robust to return outliers), shallow trees, column subsampling, and early stopping on a validation slice inside each walk-forward fold.

Training windows span roughly twelve months; validation three months; test three months; one-day embargo between validation end and test start to limit label leakage from overlapping forward returns.

The ML alpha is evaluated with the same IC and quintile machinery as single-factor alphas. Feature-importance bars show average gain across folds; SHAP summaries compress the beeswarm view into mean absolute contribution per feature on the last test sample.

Limitations and interpretation

Fifty stocks is a narrow cross-section compared with CSI 300 or full NSE universes; IC estimates are noisier and quintile buckets hold fewer names.

Published closing prices may differ slightly from exchange official figures; corporate actions are handled via standard price adjustment.

Industry neutralisation is not applied — only log market cap — because sector classifications are not part of this study.

Backtests are paper portfolios without Indian transaction taxes, stamp duty, or short borrow; live implementation would reduce net spreads.

In-sample factor mining across fifteen correlated alphas increases multiple-testing risk; treat significance as exploratory.

Interactive results (below)

The screening table lists IC mean, IC standard deviation, ICIR, and composite membership for every alpha.

Use the factor tabs to inspect daily IC (bars) and cumulative IC (line), quintile NAV curves, and the full performance table for any signal.

The synthetic factor section shows the rolling OLS composite; the ML alpha section appears when the boost model ran successfully in the latest build.

Universe	Nifty 50 — 50 stocks with sufficient history
Sample period	2015-01-01 to 2026-07-17
Alphas computed	15
OLS composite inputs	alpha038, alpha101, alpha_5_day_reversal (fallback top icir)
Market	India — NSE-listed equities
ML walk-forward	40 out-of-sample folds

Factor	IC mean	IC std	ICIR	In composite
alpha038	0.0330	0.1890	0.176	yes
alpha101	-0.0270	0.1800	-0.153	yes
alpha_5_day_reversal	-0.0260	0.2000	-0.132	yes
alpha041	0.0220	0.1690	0.127	—
alpha001	-0.0200	0.1680	-0.117	—
alpha006	0.0160	0.1580	0.099	—
alpha042	0.0180	0.1940	0.091	—
alpha040	0.0130	0.1660	0.078	—
alpha094	0.0130	0.1780	0.075	—
alpha002	0.0110	0.1530	0.069	—
alpha072	0.0090	0.1510	0.058	—
alpha012	0.0090	0.1540	0.055	—
alpha088	0.0070	0.1490	0.048	—
alpha003	0.0010	0.1490	0.003	—
alpha098	0.0000	0.1480	0.000	—

Information coefficient —

Quintile cumulative NAV —

Information coefficient — composite

Quintile cumulative NAV — composite

Bucket	Cum Return	Ann Return	Ann Vol	Sharpe	Max DD
G1	119.70%	7.40%	19.50%	0.044	-54.40%
G2	236.00%	11.60%	18.00%	0.282	-45.20%
G3	799.20%	21.90%	18.10%	0.853	-42.90%
G4	1161.90%	25.70%	18.10%	1.063	-38.00%
G5	1118.40%	25.30%	18.70%	1.008	-34.10%
L-S	380.90%	15.20%	14.90%	0.586	-34.50%

IC mean 0.0250 · ICIR 0.150 · 40 folds

Information coefficient — ML alpha

Quintile cumulative NAV — ML alpha

Bucket	Cum Return	Ann Return	Ann Vol	Sharpe	Max DD
G1	168.00%	10.40%	18.00%	0.215	-44.30%
G2	327.00%	15.60%	18.20%	0.500	-36.70%
G3	723.70%	23.50%	17.30%	0.981	-33.90%
G4	792.30%	24.50%	18.30%	0.980	-42.10%
G5	995.40%	27.00%	19.20%	1.068	-43.50%
L-S	281.10%	14.30%	13.80%	0.567	-16.80%

Feature importance (mean gain across folds)

SHAP contribution (mean absolute value)