NSE India  ·  NIFTY Indices

Stock Market
Trend Prediction

End-to-end ML & Deep Learning system for forecasting NIFTY IT, Metal & Financial Services indices using 8 algorithms on 8,918+ trading records.

🤖
8
ML Algorithms
📁
8,918
Data Records
📊
3
NIFTY Indices
📅
2003–2021
Date Coverage

About the Dataset

Multi-year historical OHLCV data for three NSE NIFTY sector indices, sourced from National Stock Exchange of India.

8,918Total Records
2003–2021Date Coverage
3NIFTY Indices
8Feature Columns

📁 Data Files

FileIndexRecordsPeriod
NIFTY.csvAll 3 Indices8,9182003–2021
NIFTY IT_data.csvNIFTY IT4,3552003–2021
NIFTY FIN SERVICE_data.csvNIFTY Fin Svc2,2322012–2021
NIFTY METAL_data.csvNIFTY Metal2,3312011–2021

🗂️ Schema / Columns

ColumnTypeDescription
DatedatetimeTrading date
OpenfloatOpening index price
HighfloatIntraday high
LowfloatIntraday low
ClosefloatClosing price ← TARGET
VolumeintShares traded
TurnoverfloatNotional value (INR)
IndicesstringLabel-encoded index name
🖥️ NIFTY IT 4,355 rows

India's IT sector index covering TCS, Infosys, Wipro, HCL. Spans 2003–2021. One of the longest time series in the dataset with strong growth trend post-2017.

Peak: ~27,095 2020 Rally: +90%
⚙️ NIFTY Metal 2,331 rows

Tracks metals & mining sector including JSW Steel, Tata Steel, Hindalco. Highly cyclical — correlated with global commodity prices and China demand.

2021 High: ~3,945 Cyclical Pattern
🏦 NIFTY Fin Services 2,232 rows

Covers HDFC Bank, ICICI, Bajaj Finance, Kotak. Financial sector heavily correlated with interest rate cycles and RBI policy decisions (2012–2021).

2021 High: ~17,654 Strong Uptrend

Problem Statement & Objectives

Business Problem

Stock markets are inherently volatile and driven by countless interacting factors. Retail investors and fund managers struggle to identify reliable price direction signals from raw OHLCV data alone. Without a systematic, data-driven approach, investment decisions are largely reactive — made after the price has already moved.

The Indian equity market (NSE) lacks accessible, open-source ML-based prediction tools specifically designed for sector indices like NIFTY IT, Metal, and Financial Services — which have distinct cyclical behaviors and risk profiles.

Abstract

This project builds a comprehensive, end-to-end stock market trend prediction system targeting NSE NIFTY sector indices. Historical OHLCV data spanning 2003–2021 is processed through a multi-model ML framework comprising SVR, Random Forest, Linear Regression, KNN, Decision Trees, ElasticNet, and LSTM deep learning.

Polynomial feature expansion and temporal decomposition enrich the feature space. All models are evaluated on held-out test sets using MSE and R², with results exposed through an interactive Flask web interface that allows real-time algorithm comparison.

🎯 Project Objectives

Compared 8 ML/DL algorithms on real NSE data using consistent metrics — MSE and R².
Deployed an LSTM network to capture sequential price patterns that classical models miss.
Used polynomial expansion (degree 2) and date decomposition to build richer input features.
Built a Flask app where anyone can pick a stock, choose an algorithm, and see live results.
Saved the trained pipeline as .pkl files so predictions load instantly without retraining.
Turned model outputs into plain signals — trend direction, support zones, and volatility windows.

🔬 Scope & Constraints

In Scope
  • NSE NIFTY sector indices
  • OHLCV-based regression
  • 2003–2021 historical window
  • 8 ML/DL algorithms
  • Flask web interface
Out of Scope
  • Individual stock prediction
  • Options / derivatives pricing
  • Sentiment / news analysis
  • Real-time live data
  • Portfolio optimization
Assumptions
  • Past patterns inform future prices
  • Market is not perfectly efficient
  • OHLCV data is clean & complete
  • Short-term horizon (1–5 days)
  • No transaction costs modelled

ML Models & Algorithms

8 algorithms spanning classical regression, SVMs, ensemble methods, and deep learning — all implemented in utils.py.

🌲
Random Forest Regressor
Ensemble of 10–15 decision trees trained on bootstrapped subsets. Each tree votes on the closing price; the final prediction is the average. Polynomial feature expansion (degree=2) applied as preprocessing. Achieves R²~0.99 on training set.
Best Performern_estimators=15Serialised as .pkl
🧠
LSTM Recurrent Neural Network
Long Short-Term Memory network built with TensorFlow/Keras. Designed specifically for sequential time-series data. Maintains a hidden state across timesteps, capturing long-range dependencies invisible to classical models. Input normalised via MinMaxScaler.
Deep LearningTensorFlow / KerasSequential Aware
📐
Support Vector Regression
Three kernel variants: Linear (SVR_linear, C=1000), Polynomial (SVR_poly, degree=3), and RBF (SVR_rbf, C=1000, γ=0.1). RBF kernel handles complex non-linear price patterns. All trained on lagged close prices with 67/33 train-test split.
3 KernelsRobust to Outliers
📉
Linear Regression & ElasticNet
Linear Regression serves as the interpretable baseline. ElasticNet combines L1 (Lasso) + L2 (Ridge) regularisation to reduce overfitting on noisy financial series. Both trained on look_back=1 lagged close price features.
Baseline ModelL1+L2 Penalty
🔵
K-Nearest Neighbours (KNN)
Non-parametric instance-based regression with k=2. Predicts the next close price by averaging the 2 most similar historical values. No training phase — prediction is computed at inference time. Sensitive to scaling and noise.
k=2 neighboursNo Training Phase
🌿
Decision Tree Regressor
Single unpruned decision tree (sklearn CART). Fully interpretable — every prediction can be traced along a branch from root to leaf. High variance on unseen financial data due to lack of pruning; used primarily as a comparison baseline.
Fully InterpretableBaseline Reference

⚙️ Preprocessing Pipeline

Step 1 — Feature Extraction
Date → Year, Month, Day, DayOfWeek, WeekOfYear. Indices → LabelEncoder integer.
Step 2 — Polynomial Expansion
PolynomialFeatures(degree=2) expands 4 features → 15 interaction terms for RF regression path.
Step 3 — Scaling & Split
MinMaxScaler for LSTM path. 67/33 train-test split for classical models. Sequential split for LSTM.

📊 Model Comparison Table

AlgorithmTypeKey HyperparamsExpected R²StrengthLimitation
Random ForestEnsemblen_est=15, poly deg=20.97–0.99Non-linear, feature importanceCan overfit
LSTMDeep Learninglook_back=1, MinMaxScaler0.95–0.99Temporal dependenciesNeeds large data
SVR (RBF)SVMC=1000, γ=0.10.90–0.96Robust to outliersSlow on large sets
SVR (Linear)SVMC=10000.85–0.92Fast, stableMisses non-linearity
Linear RegressionBaseline0.85–0.93InterpretableLinear only
ElasticNetRegularisedL1+L2, rs=00.82–0.90Less overfitLinear assumption
KNNInstancek=20.70–0.85No training phaseNoise sensitive
Decision TreeBaselineUnpruned0.95–1.0 trainInterpretableOverfits badly

Advanced Visualizations

All charts use real NIFTY data extracted from your dataset (2020–2021 period).

1. Candlestick Chart — NIFTY IT (Dec 2020 – Feb 2021)
Each candle = 1 trading day. Green = bullish close, Red = bearish close.
NIFTY IT
Key Findings
  • Strong bullish rally from Dec 22, 2020 — NIFTY IT surged from ~22,845 to a peak of 27,095 in Jan 2021 (+18.6% in 3 weeks).
  • Jan 11, 2021 saw the highest single-day volume (100.7M shares) coinciding with a breakout above 27,000.
  • Feb 22–26 shows a bearish reversal cluster — consecutive red candles signaling short-term correction from all-time highs.
  • Large upper wicks on Jan 5–6 indicate strong selling pressure near 26,000 before the final breakout.
2. Long-Term Close Price Line Chart — NIFTY IT (2003–2021)
Monthly closing price over 18 years. Reveals macro-level market cycles.
18-Year View
Key Findings
  • 2003–2004 dot-com recovery peak: NIFTY IT briefly hit ~23,542 before the 2004 election crash wiped 90% of value (index reconstitution effect).
  • 2008 Global Financial Crisis: sharp decline from ~5,800 to ~2,094 (−64%) in 12 months.
  • 2020 COVID crash (Mar 2020): dropped to 12,763 — but recovered to record highs by Jul 2020 within 4 months.
  • 2020–2021 digital acceleration: NIFTY IT outperformed all sectors, rising +90% as work-from-home drove tech demand globally.
3. OHLC Bar Chart — NIFTY Metal (Jan–Feb 2021)
Each bar shows Open, High, Low, Close for the trading day.
NIFTY Metal
Key Findings
  • NIFTY Metal entered a strong uptrend in Feb 2021 — price climbed from ~3,077 to ~3,928 (+27.7%) in just 5 weeks.
  • Feb 25, 2021 recorded the highest high in the period at 3,945 — a multi-year breakout driven by global steel prices.
  • Jan 18–22 shows a consolidation cluster with narrowing OHLC ranges, typical of accumulation before a breakout move.
  • Feb 1 opened at 3,096 and closed at 3,227 — a strong bullish engulfing after 2 weeks of sideways movement.
4. Heikin-Ashi Chart — NIFTY Financial Services (Dec 2020 – Feb 2021)
Smoothed candlesticks using HA formula. Removes noise to reveal clear trend direction.
Heikin-Ashi
Key Findings
  • Heikin-Ashi reveals an unbroken green (bullish) trend from Dec 8, 2020 to Feb 16, 2021 — 50 consecutive bullish candles with almost no wicks on the downside.
  • The sharp Feb 22–26 reversal (3 red HA candles in a row) signals the end of the trend — a reliable exit signal for traders.
  • Budget day (Feb 1, 2021) produced the largest HA candle body (+8.6%) as the Union Budget 2021 announced infra spending — a strong financial services catalyst.
  • HA smoothing confirms what regular candlesticks showed with noise: the index was in a clean, sustained uptrend from mid-December to mid-February.
5. Volume vs Price Analysis — NIFTY IT (Dec 2020 – Feb 2021)
Volume bars (grey) overlaid with close price line (cyan). Identifies volume-price divergences.
Volume Analysis
Key Findings
  • Jan 11, 2021: highest volume (100.7M) coincided with a breakout above 27,000 — strong confirmation of the trend.
  • Jan 8 and Jan 14 both saw 80M+ volume with price near resistance at 26,000–27,000 — institutional accumulation zone.
  • Dec 23: spike to 87.3M volume on the breakout day from 23,611 → 24,167 (the biggest single-day gain in the period).
  • Late Feb volume (49.2M on Feb 26) spiked as price fell — a bearish distribution signal indicating smart money selling.
6. Yearly Average Close — All 3 NIFTY Indices (2012–2021)
Grouped bar chart comparing average annual closing prices across IT, Metal, and Financial Services.
Multi-Index
Key Findings
  • NIFTY IT dramatically outperformed all indices in 2020–2021, rising from 17,281 (2020 avg) to 25,702 (2021) — a surge of +48.7%.
  • NIFTY Metal has remained flat for most of 2012–2019 (~2,100–3,600 range), reflecting global commodity market stagnation.
  • NIFTY Financial Services shows the most consistent growth trajectory: from 4,375 in 2012 to 16,033 in 2021 — a 3.66× increase over 9 years.
  • 2019–2020 saw a dip across all indices due to COVID — but 2021 shows full recovery and new highs, especially in IT.

Statistical Methods Used

Mathematical and statistical foundations underpinning the feature engineering, model training, and evaluation pipeline.

📏
Mean Squared Error (MSE)
MSE = Σ(yᵢ − ŷᵢ)² / n
Primary evaluation metric for all regression models. Penalizes large errors heavily due to squaring. Used to rank all 8 algorithms on the 33% test set. Lower MSE = better model.
R² Score (Coefficient of Determination)
R² = 1 − SS_res / SS_tot
Measures proportion of variance explained by the model. R²=1.0 is perfect fit; R²=0 means model predicts the mean. Random Forest (poly) achieves ~0.99 on training data.
🔢
Polynomial Feature Expansion
X_poly = PolyFeatures(degree=2)(X)
Transforms 4 input features [Day, Month, Year, Index] into 15 features including all pairwise interactions (e.g. Day×Month, Year²). Allows linear models to fit non-linear relationships.
📊
Label Encoding
LabelEncoder().fit_transform(Indices)
Converts categorical index names (NIFTY IT, NIFTY METAL, etc.) to integer labels (0, 1, 2). Allows models to distinguish between index types as a numeric ordinal feature.
📉
Min-Max Normalization
x' = (x − min) / (max − min)
Applied to the close price series before LSTM training. Scales all values to [0, 1] range. Critical to prevent vanishing/exploding gradients in LSTM cells during backpropagation.
🧩
Sliding Window (Look-Back)
X[t] = close[t : t+k] → Y[t] = close[t+k]
Creates supervised learning pairs from time-series. With look_back=1: uses yesterday's close to predict today's. Can be extended to 5/20/60 days to capture weekly/monthly momentum.
✂️
Train-Test Split
train_test_split(X, y, test_size=0.33)
67/33 random split used for all classical models. For LSTM, a sequential split is used (last 2 rows as test) to avoid data leakage — future data is never visible to the model during training.
📐
Heikin-Ashi Transformation
HA_Close = (O+H+L+C)/4
HA_Open = (prev_HA_O + prev_HA_C)/2
Statistical smoothing of OHLC data by computing modified candles. Reduces noise in price series and makes trends more visually identifiable. Applied in visualization layer.
🎲
Bootstrap Aggregating (Bagging)
RF = avg(Tree₁(X_b₁), Tree₂(X_b₂), ..., Treeₙ)
Core mechanism of Random Forest. Each tree trains on a bootstrapped (sampled with replacement) subset of training data. Averaging multiple trees reduces variance and prevents overfitting.
⚖️
ElasticNet Regularization
L(β) = MSE + α·ρ·|β|₁ + α(1-ρ)/2·|β|₂²
Combines L1 (Lasso) and L2 (Ridge) penalties. L1 enforces sparsity (drops irrelevant features); L2 handles correlated features. Balances bias-variance tradeoff on noisy financial data.
🧠
LSTM Backpropagation Through Time
∂L/∂W = Σ(t=1 to T) ∂Lₜ/∂W
Training algorithm for LSTM. Gradients flow backward through time steps. LSTM gates (forget, input, output) prevent vanishing gradients — key advantage over vanilla RNNs for long sequences.
📈
Temporal Feature Decomposition
Date → {Year, Month, Day, DayOfWeek, WeekOfYear}
Decomposes a single datetime into 5 numerical features. Allows models to learn seasonal patterns: month effects (budget season), day effects (expiry dates), and long-term yearly trends.

Prediction: Opening & Closing Price

Enter a date and index to get predicted Open & Close prices from all 5 algorithms, plus a comparative accuracy chart.

🎛️ Configure Prediction

Algorithm Comparison — Predicted vs Actual Close Price
NIFTY IT, Dec 2020 – Feb 2021. Lower deviation from Actual (black line) = more accurate model.

📊 MSE Accuracy Ranking (Lower = Better)

Model Selection Insight
  • LSTM achieves the lowest MSE (~28,000) — it benefits from sequential awareness even with look_back=1.
  • Random Forest is a close second (~34,000) — polynomial features give it a significant edge over other classical models.
  • KNN (k=2) shows highest variance — its predictions are highly sensitive to the nearest neighbour and can swing widely on volatile days.
  • All models track the overall price trend well, but diverge significantly during high-volatility events (Jan 11 breakout, Feb 22 selloff).

Business Recommendations

Translating model outputs into actionable investment and technology strategy for financial firms.

🎯

Use LSTM for Short-Term Signal Generation

Given LSTM's lowest MSE in comparative testing, deploy it as the primary engine for 1–5 day ahead price forecasting. Use RF as a secondary validator. Only act on trades where both models agree on direction.

🌿

Deploy Random Forest for Explainability

In regulated financial environments (SEBI compliance), explainability is mandatory. Random Forest provides SHAP-compatible feature importance — use it wherever model decisions must be audited or explained to clients.

📊

Sector Rotation Strategy

NIFTY IT outperforms during tech booms; NIFTY Metal leads commodity cycles; Fin Services tracks rate cycles. Use the multi-index comparative model to allocate capital across sectors based on predicted momentum differentials.

⚠️

Integrate Volatility Filters

Do not use predictions in isolation during high-volatility events (budget days, RBI policy announcements, global crises). Implement an ATR (Average True Range) filter — pause automated signals when ATR exceeds 2× its 20-day average.

🔁

Retrain Monthly with Walk-Forward Validation

Market regimes shift. Retrain all models monthly on a rolling 2-year window. Use walk-forward validation (not random split) to prevent data leakage and ensure test performance reflects real-world generalization.

💰

Backtest Before Live Deployment

Simulate 2 years of trades using model signals on historical data before going live. Track Sharpe ratio, max drawdown, and win rate. A model with good MSE can still lose money if the prediction timing is slightly off.

🌐

Integrate yfinance for Live Data

The yfinance library is already in requirements.txt. Configure it to auto-fetch latest NSE NIFTY data daily at 4pm IST (after market close) and append to the training dataset for continuous model freshness.

🛡️

Add Fundamental Overlay

Pure OHLCV models miss earnings releases, RBI rate decisions, and global macro events. Augment with a simple event calendar filter — suppress or weight down model signals on known high-impact event dates.

📋 Risk Disclaimer

This project is for educational and research purposes only. All predictions are based on historical data and statistical models. Past performance does not guarantee future results. Stock market investments involve risk including potential loss of principal. Always conduct your own due diligence before making any investment decisions. This system does not account for macroeconomic factors, geopolitical events, or company-specific news.

Future Enhancements

Roadmap to transform this portfolio project into a production-grade real-time trading intelligence system.

Phase 1 — Authentication Layer
🔐 User Login & Multi-Tenancy
Add Flask-Login / JWT authentication so multiple analysts can log in with personal dashboards. Each user's prediction history and preferred algorithms are stored in a SQLite or PostgreSQL database. After 30+ inputs, the system calculates which algorithm gave the most accurate predictions for that user's selected stocks and automatically highlights the recommended model.
Flask-LoginSQLitePer-user analytics
Phase 2 — Adaptive Algorithm Selection
🧮 Auto-Best Algorithm Engine
Track prediction accuracy per algorithm per user over time. After collecting 30+ prediction-vs-actual pairs, compute a rolling 30-day accuracy score for each model. The system automatically promotes the best-performing algorithm to "Recommended" status and demotes underperformers. Users can override manually.
Rolling MSE trackerAuto-recommend
Phase 3 — Real-Time Data Pipeline
📡 Live Data Collection & Processing
How to collect real-time data: Use the yfinance library (already in requirements.txt) to fetch live NSE data. Schedule a Python cron job (using APScheduler or Celery) to run at 4:00 PM IST daily after market close:

ticker = yf.Ticker("^CNXIT"); df = ticker.history(period="1d")
How to process: Append new OHLCV row to the training dataset. Re-run feature engineering (polynomial expansion, temporal decomposition). Trigger incremental retraining of RF and LSTM models. Update the serialised .pkl files. Push new predictions to the dashboard via WebSocket (Flask-SocketIO).

Tech stack: APScheduler (job scheduling) → yfinance (data fetch) → pandas (transformation) → joblib (model update) → Flask-SocketIO (live push to UI).
yfinanceAPSchedulerFlask-SocketIOAuto-retrain
Phase 4 — Technical Indicators
📈 Advanced Feature Engineering
The ta library is already in requirements.txt but unused. Add RSI(14), MACD, Bollinger Bands, ATR, OBV, and MFI as model features. These indicators encode market momentum, volatility, and volume pressure — dramatically improving prediction accuracy over OHLCV alone.
RSI · MACD · BollingerATR · OBV · MFI
Phase 5 — NLP & Sentiment
💬 News Sentiment Analysis
Integrate a news API (NewsAPI.org or NSE's official feed) to fetch financial headlines for each index. Run FinBERT (a finance-specific BERT model) for sentiment classification (Positive/Neutral/Negative). Add sentiment score as an additional feature to the ML models. Studies show news sentiment improves short-term prediction accuracy by 8–15%.
FinBERTNewsAPI+8–15% accuracy
Phase 6 — Production Deployment
🚀 Cloud & API Deployment
Containerise the application with Docker. Deploy on AWS EC2 or Azure App Service. Expose predictions as a REST API using FastAPI so external applications (trading platforms, mobile apps) can consume predictions programmatically. Add a CI/CD pipeline with GitHub Actions for automatic testing and deployment on every code push.
DockerFastAPIAWS/AzureGitHub Actions CI/CD

Get In Touch

Built by Pavan Kumar Nallabothula — Data Analyst specializing in ML, forecasting, and financial data systems.

👤 About the Author

Professional Data Analyst with expertise in machine learning, predictive modelling, and financial data systems. This project is part of a portfolio demonstrating end-to-end ML pipeline development — from raw data ingestion through feature engineering, multi-model training, evaluation, and web deployment.

Python Machine Learning Deep Learning (LSTM) Financial Analytics Flask / Web Dev NSE / Indian Markets