Nifty 50 Alpha101 Selection & Composite Factor Research

Indian large caps Price-volume alphas Cross-sectional cleaning Information coefficient screening Linear factor combination Machine-learning composite Quintile portfolio backtests Nifty 50 constituents

Summary

This study asks whether classic formulaic alphas — compact rules built from open, high, low, close, and volume — still separate winners from losers among Nifty 50 constituents over a long daily history of NSE-listed prices.

The workflow has three layers: (1) signal engineering on a daily panel of roughly fifty NSE stocks; (2) statistical evaluation via cross-sectional information coefficients and quintile spreads; (3) factor combination with both a transparent rolling linear model and an optional gradient-boosted machine-learning overlay.

Results are descriptive research outputs. They illustrate how multi-factor selection behaves on Indian large caps over the sample window; they are not live trading recommendations.

Research objectives

Measure predictive content of each alpha using daily Spearman IC and ICIR (mean IC divided by IC volatility).

Identify alphas stable enough to feed a composite score that ranks stocks each day.

Simulate long–short and quintile portfolios to see whether ranked signals translate into monotonic return patterns.

Compare a linear composite (rolling OLS with orthogonalized inputs) against a walk-forward LightGBM model trained on the same cleaned features.

Universe and market data

The investable set is the Nifty 50 index: liquid names traded on the National Stock Exchange of India. The sample focuses on current index members with sufficient listing history.

Daily adjusted open, high, low, close, and volume span from 2015 onward. Names with fewer than about five hundred trading days are dropped so rolling windows (20-day volatility, 60-day training) are well defined.

Market capitalisation at each rebalance supports size neutralisation. VWAP in volume-sensitive alphas is approximated as the typical price (average of high, low, and close) when exchange turnover detail is unavailable.

How the study is conducted

Step 1 — Panel construction. Prices are organised by trade date and stock. Forward one-day returns are the percentage change from today’s close to the next session’s close, aligned to today for prediction (no look-ahead in labels).

Step 2 — Alpha computation. Fifteen formulaic signals inspired by the WorldQuant Alpha101 library are evaluated in wide matrix form (dates × stocks), then stacked. Examples include short-horizon reversal, rank correlations of open and volume, VWAP distance, and close–open range ratios.

Step 3 — Cross-sectional cleaning. For each date independently: replace infinities; MAD winsorise (median ± 3× scaled MAD); z-score; regress each factor on log market cap and take residuals; z-score again; fill remaining gaps with zero.

Step 4 — Single-factor diagnostics. Every cleaned alpha is correlated with next-day returns across stocks (Spearman IC). Quintile portfolios sort stocks into five equal buckets; cumulative NAV paths and G5−G1 spreads are recorded.

Step 5 — Linear composite. Alphas passing IC screens (or, if none pass, the top three by |ICIR|) enter a 60-day rolling OLS. Inputs are Löwdin-orthogonalised each day; the regression target is the cross-sectional percentile rank of forward return so the model learns relative ordering, not market beta.

Step 6 — Machine-learning overlay (optional). The same cleaned features feed a walk-forward LightGBM regressor with embargo between validation and test windows. Out-of-sample scores become an ML alpha; feature importance and mean |SHAP| values summarise drivers.

How signals are evaluated

Information Coefficient (IC). Each trading day, factor values and realised next-day returns are ranked across stocks; IC is the Spearman correlation of those ranks. A positive IC means higher factor scores tended to precede higher returns.

ICIR. Mean daily IC divided by its standard deviation. Higher |ICIR| suggests a more stable signal; very low ICIR often means noise dominates.

Selection rule. Strict inclusion requires |mean IC| > 2% and |ICIR| > 0.30. On Nifty 50 these thresholds are demanding; when no factor qualifies, the composite uses the three alphas with the largest |ICIR| and |mean IC| ≥ 1%.

Layered backtest. Stocks are sorted into quintiles G1 (lowest factor) through G5 (highest). Each bucket is equal-weighted; the long–short leg is G5 minus G1. Reported metrics: cumulative return, annualised return and volatility, Sharpe ratio, and maximum drawdown — before brokerage, STT, or borrow costs.

Machine-learning extension

The gradient-boosted model uses L1 regression (robust to return outliers), shallow trees, column subsampling, and early stopping on a validation slice inside each walk-forward fold.

Training windows span roughly twelve months; validation three months; test three months; one-day embargo between validation end and test start to limit label leakage from overlapping forward returns.

The ML alpha is evaluated with the same IC and quintile machinery as single-factor alphas. Feature-importance bars show average gain across folds; SHAP summaries compress the beeswarm view into mean absolute contribution per feature on the last test sample.

Limitations and interpretation

Fifty stocks is a narrow cross-section compared with CSI 300 or full NSE universes; IC estimates are noisier and quintile buckets hold fewer names.

Published closing prices may differ slightly from exchange official figures; corporate actions are handled via standard price adjustment.

Industry neutralisation is not applied — only log market cap — because sector classifications are not part of this study.

Backtests are paper portfolios without Indian transaction taxes, stamp duty, or short borrow; live implementation would reduce net spreads.

In-sample factor mining across fifteen correlated alphas increases multiple-testing risk; treat significance as exploratory.

Interactive results (below)

The screening table lists IC mean, IC standard deviation, ICIR, and composite membership for every alpha.

Use the factor tabs to inspect daily IC (bars) and cumulative IC (line), quintile NAV curves, and the full performance table for any signal.

The synthetic factor section shows the rolling OLS composite; the ML alpha section appears when the boost model ran successfully in the latest build.

Empirical results

Interactive diagnostics from the latest study. Time series are downsampled for display; performance tables use the full sample.

Study snapshot

UniverseNifty 5049 stocks with sufficient history
Sample period2015-01-01 to 2026-05-25
Alphas computed15
OLS composite inputsalpha038, alpha101, alpha_5_day_reversal (fallback top icir)
MarketIndia — NSE-listed equities
ML walk-forward39 out-of-sample folds

Factor screening

Cross-sectional Spearman IC diagnostics for each cleaned alpha. Composite membership follows strict IC/ICIR thresholds or a ranked fallback when no factor clears the bar.

FactorIC meanIC stdICIRIn composite
alpha0380.03400.18900.178yes
alpha101-0.02800.1800-0.157yes
alpha_5_day_reversal-0.02700.2000-0.134yes
alpha0410.02200.17000.127
alpha001-0.02100.1690-0.121
alpha0060.01600.15900.100
alpha0420.01900.19400.095
alpha0400.01400.16700.082
alpha0940.01400.17900.080
alpha0020.01000.15500.064
alpha0120.00900.15500.059
alpha0720.00800.15300.052
alpha0880.00600.15000.039
alpha0030.00300.14900.019
alpha098-0.00100.1500-0.008

Single-factor diagnostics

Select an alpha to view its information coefficient path and quintile backtest. G1 is the lowest factor bucket; G5 the highest; L-S is the top-minus-bottom spread.

Information coefficient —

Quintile cumulative NAV —

Rolling linear composite

Multi-alpha score from 60-day rolling OLS on orthogonalised inputs. Training labels are cross-sectional ranks of next-day return.

Information coefficient — composite

Quintile cumulative NAV — composite

BucketCum ReturnAnn ReturnAnn VolSharpeMax DD
G1124.50%7.70%19.00%0.063-49.70%
G2252.00%12.20%17.90%0.320-47.60%
G31111.50%25.70%17.90%1.073-37.40%
G41285.30%27.20%17.90%1.156-35.80%
G5889.10%23.40%18.90%0.893-36.50%
L-S289.90%13.30%14.70%0.461-32.00%

Machine-learning alpha

LightGBM walk-forward model trained on the same cleaned alphas. Out-of-sample predictions are ranked and backtested like single-factor signals.

IC mean 0.0240 · ICIR 0.143 · 39 folds

Information coefficient — ML alpha

Quintile cumulative NAV — ML alpha

BucketCum ReturnAnn ReturnAnn VolSharpeMax DD
G1221.40%12.70%18.00%0.346-42.10%
G2291.60%15.00%18.20%0.469-38.90%
G3573.60%21.60%17.70%0.854-37.20%
G4930.40%27.00%18.10%1.137-38.00%
G5874.30%26.30%18.90%1.046-44.00%
L-S182.10%11.20%13.70%0.344-22.20%

Feature importance (mean gain across folds)

SHAP contribution (mean absolute value)

QuantifiedTrader is operated by an independent research-only group focused on building, documenting, and improving open quantitative-finance tools. Our purpose is to study markets, models, and methods—not to sell products, manage assets, or act on behalf of third parties.

No services. We do not provide investment, trading, brokerage, advisory, portfolio-management, custody, tax, legal, or any other professional or commercial services to any person or entity. Nothing on this site constitutes an offer, solicitation, recommendation, or endorsement to buy or sell securities or to adopt any investment strategy.

Research & education only. Content, data, backtests, charts, and software made available here are for informational and educational research. They may be incomplete, simulated, or based on third-party sources; past performance is not indicative of future results. You are solely responsible for your own decisions and for verifying any information before use.

No commercial benefit from shared knowledge. This site does not aim to profit from the knowledge, tools, or datasets published here. Materials are provided without charge for non-commercial research and learning, subject to applicable open-source or site terms where noted.

Disclaimer of warranties. All content and tools are supplied “as is” and “as available,” without warranties of any kind, express or implied, including accuracy, fitness for a particular purpose, or non-infringement. We disclaim liability for any loss or damage arising from use of or reliance on this site, to the fullest extent permitted by law.

Contact & disputes. For questions about this notice, the site, or any dispute relating to published materials, contact support@quantedx.com. We will endeavour to respond in good faith; this contact channel is for administrative and research correspondence only and does not create a client, advisory, or fiduciary relationship.

© 2026 QuantifiedTrader