Rajandran R FollowTelecom Engineer turned Full-time Derivative Trader. Mostly Trading Nifty, Banknifty, USDINR and High Liquid Stock Derivatives. Trading the Markets Since 2006 onwards. Using Market Profile and Orderflow for more than a decade. Designed and published 100+ open source trading systems on various trading tools. Strongly believe that market understanding and robust trading frameworks are the key to the trading success. Writing about Markets, Trading System Design, Market Sentiment, Trading Softwares & Trading Nuances since 2007 onwards. Author of Marketcalls.in and Co-Creator of Algomojo (Algorithmic Trading Platform for DIY Traders)

5 min read

Machine learning has become an integral part of stock market analysis and prediction. Linear Regression is a widely used algorithm for predicting stock prices. In this blog, we will discuss the Linear Regression model for predicting stock prices using the Python programming language.

## What is Linear Regression

**Linear regression is a type of supervised learning algorithm** that makes predictions based on a linear relationship between the input variables (also known as features) and the output variable (also known as the target variable).

In the case of stock price prediction, the linear regression model is trained on historical stock price data, which includes **features such as the open, high, low, close, volume, rsi, ema, hma, adx, atr for a given day**. The **target variable is typically the closing_forecast** since this is the price that investors are most interested in predicting.

The linear regression model uses the training data to learn the relationship between the input variables (features) and the target variable(prediction), by estimating the coefficients of a linear equation that best fits the data. Once the model has been trained, it can be used to make predictions on new, unseen data, by simply plugging in the values of the input variables and solving for the output variable using the learned coefficients.

**Preparing the Feature Dataset**

The dataset used for the implementation of the model is the **NIFTY_EOD.csv** file, which consists of **open, high, low, volume, previous close, RSI, EMA, HMA, ADX, PDI, MDI, and ATR values** for a particular stock.

Download the Linear Regression Features Dataset prepared using Amibroker

Download NIFTY EOD csv data set

The dataset is split into two parts, training data, and testing data. The training data is used to train the model, and the testing data is used to evaluate the performance of the model.

The Linear Regression model is created using the LinearRegression() function from the scikit-learn library. The model is trained using the fit() method of the LinearRegression class on the training data.

**Python Source code for Linear Regression Based Machine Learning Prediction**

`import pandas as pdfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import mean_squared_error, r2_score, explained_variance_scoreimport numpy as np# Load the datastock = pd.read_csv('NIFTY_EOD.csv')df = stock# Prepare the datay = df['close_forecast'] ##targetX = df.drop(columns=['Ticker','Date/Time','close_forecast'], axis=1) ##feature inputX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)# Train the modelmodel = LinearRegression()model.fit(X_train_scaled, y_train)# Make predictionsy_pred = model.predict(X_test_scaled)# Compute accuracy metricsmse = mean_squared_error(y_test, y_pred)print("Mean squared error: ", mse)# Save predicted vs actual values to CSVdf_pred = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})df_pred.to_csv('NIFTY_EOD_pred.csv', index=False)# Compute accuracy metricsmape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100r2 = r2_score(y_test, y_pred)ev = explained_variance_score(y_test, y_pred)mse = mean_squared_error(y_test, y_pred)rmse = np.sqrt(mse)mae = np.mean(np.abs(y_test - y_pred))# Print accuracy metricsprint("Mean absolute percentage error (MAPE): ", mape)print("R-squared: ", r2)print("Explained variance: ", ev)print("Mean squared error: ", mse)print("Root mean squared error (RMSE): ", rmse)print("Mean absolute error (MAE): ", mae)# Make a prediction for the next day's close pricelast_row = X.tail(1)last_row_scaled = scaler.transform(last_row)next_day_pred = model.predict(last_row_scaled)[0]print("Predicted close price for the next day: ", next_day_pred)`

**Python Output**

`Mean squared error: 5131.098941109195Mean absolute percentage error (MAPE): 1.0399683400963986R-squared: 0.9997643613546477Explained variance: 0.9997647348779873Mean squared error: 5131.098941109195Root mean squared error (RMSE): 71.63168950338387Mean absolute error (MAE): 42.91492045843861Predicted close price for the next day: 17400.060390626983`

Here are the common steps involved in linear regression prediction:

**Data Collection**: Collect relevant data related to the problem statement. In case of stock price prediction, data such as historical stock prices, volumes, market trends, etc., are collected.**Data Preprocessing**: This step involves cleaning and preparing the data for analysis. It includes removing any missing values, handling outliers, scaling/normalizing the data, etc.**Feature Selection**: Identifying the features that are most relevant to the problem statement. In case of stock price prediction, features such as open price, close price, volume, etc., may be considered.**Training the Model**: This involves selecting a machine learning algorithm, such as linear regression, and training the model on the prepared data. During this step, the model learns to make predictions based on the patterns found in the data.**Model Evaluation**: This step involves evaluating the performance of the model on a separate set of data (testing data) that was not used during training. Common evaluation metrics include mean squared error, mean absolute error, and R-squared.**Hyperparameter Tuning**: The performance of the model can be improved by tuning the hyperparameters of the algorithm. This involves selecting optimal values for parameters such as learning rate, regularization, and number of iterations.**Prediction**: Once the model is trained and evaluated, it can be used to make predictions on new, unseen data.**Model Deployment**: The final step involves deploying the trained model into a production environment for use in real-world applications.

After training the model, we can make predictions on the testing data using the predict() method of the LinearRegression class. **The accuracy of the model is evaluated using different metrics, such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared, and Explained Variance.**

In the above Python code, the **metrics of the Linear Regression model** are calculated as follows:

**Mean squared error (MSE):**This metric measures the average of the squared differences between the predicted and actual values. In this case, the MSE is 5131.098941109195, which indicates that the model’s predictions are, on average, 5131.098941109195 units away from the actual values.**Mean absolute percentage error (MAPE):**This metric measures the average percentage difference between the predicted and actual values. A value of 1.0399683400963986 indicates that, on average, the model’s predictions are about 1.04% away from the actual values.**R-squared (R2)**: This metric measures the proportion of the variance in the dependent variable (stock price) that is explained by the independent variables (predictors) in the model. An R2 value of 0.9997643613546477 indicates that 99.98% of the variance in the stock price can be explained by the predictors in the model.**Explained variance:**This metric is similar to R2 in that it measures the proportion of variance in the dependent variable that is explained by the model. An explained variance value of 0.9997647348779873 indicates that the model explains 99.98% of the variance in the stock price.**Root mean squared error (RMSE):**This metric measures the square root of the average of the squared differences between the predicted and actual values. In this case, the RMSE is 71.63168950338387, which indicates that the model’s predictions are, on average, 71.63168950338387 units away from the actual values.**Mean absolute error (MAE):**This metric measures the average absolute difference between the predicted and actual values. In this case, the MAE is 42.91492045843861, which indicates that the model’s predictions are, on average, 42.91492045843861 units away from the actual values.**Predicted close price for the next day:**This is the actual prediction made by the model for the close price of the stock for the next day, based on the data and model used to generate the above metrics. In this case, the predicted close price is 17400.060390626983.

In addition, the code outputs the predicted close price for the next day, which is 17400.06.

One **disadvantage of the linear regression model** is that it assumes a linear relationship between the input variables and the target variable. In reality, the relationship between the input variables and the target variable may be nonlinear, which can lead to poor performance of the linear regression model. Additionally, linear regression is sensitive to outliers in the data, which can skew the learned coefficients and lead to poor predictions. Finally, linear regression assumes that the input variables are independent of each other, which may not be the case in practice.

In conclusion, the Linear Regression model is a powerful machine-learning algorithm for predicting stock prices. The above Python code shows how to implement the Linear Regression model for predicting stock prices and evaluating its performance using various metrics. The accuracy of the model can be improved by using more relevant features and tuning the hyperparameters.