Machine Learning for Finance I – All you need to know about Linear Regression

Theory and Python example!

May 22, 2023

Today’s article starts the Machine Learning for Finance – All you need to know that will take a deep dive into various machine learning algorithms in context of quantitative finance.

Machine Learning is the frontier of Technological Revolution within Finance world that is going on right now. Successful Data Scientist are favored in high demand over our industry and not without a reason. Successful Machine Learning algorithm can create complicated, dynamically changing algorithms that are able to spot the patterns that are simply not trackable by humble human beings. Having that power in your skillset will undoubtedly open you a lot of doors and bring your investment analysis on the next level!

Linear Regression is considered the simplest and most widely used Machine Learning algorithm available at the moment. However despite its simplicity it can be very powerful and accurate is executed properly as you can see in example later on in the article.

What is Linear Regression

Let’s start from definition and purpose. Linear regression is used to predict linear trends in data. Similarly to all machine learning algorithms all the significance lies in features that are being adjusted to the model. LR can be described as finding a line of best fit between the trendline predicted result and actual data measuring its accuracy.

Linear Regression can be described by a formula: Y = b1*X1 + b2*X2 + C +e

Where X1 and X2 are independent variables (moving averages in our case), while a and b are slopes that is indicator of how much dependent variable (Y) when corresponding X amends. There is also C that stands for constant value or Intercept.

To clarify, please have a look at example trend line

The key thing we did not mention yet is error. Error is a mismatch between average value and the trendline. In order to Normalize it we would use Mean Square Error (MSE) which does exactly how it sounds, it takes take average from error (i.e. distance of actual value from the trend line) raised to the power of two (to cut off the negative values).

This can be expressed by formula:

Where N is the number of samples, yi is the actual value and (b1x1+b0) is predicted value.

Cost Function for Linear Regression – Gradient Decent

That’s the point where Linear Regression does its thing. Once we have basic trendline established Linear Regression algorithms uses Gradient Descent to minimalize the Cost Function, the distance from trendline to actual value normalized by previously mentioned Mean Square Error (MSE). Algorithm tries to find the values of b1,b2… that as mentioned before are coefficients for independent variables (please see the top formula) that are reducing the MSE as much as possible.

Algorithm does it by randomly allocating those values to find the closest match. Here the concept of High and Low learning rates comes into picture. In gradient descent algorithm the number of tries algorithm performs can be considered as learning rate therefore it decides how fast data are being converged by algorithm. With Low Learning Rate, differences between following tries are small and algortithm despite slowly goes to destined value without overshot. The opposite happens with High learning rate where differences between next tries are significant which increases the risk of overshot but eventually can obtain optimal value in shorter time.

Evaluation Metrics

Once the model is ready, there are 2 accuracy measures that are to be applied to measure so called goodness of fit of the model

R-squared (Coefficient of determination)

R-squared offsets the Residual Sum of Squares RSS (used in the regression model) against Total Sum of Squares (TSS) that are indicated by the average trendline. Therefore R-squared indicates how accurate the linear regression is in relation to trendline.

R squared formula:

R² = 1 – ( RSS/TSS )

Formula for RSS:

Therefore, in other words, we may describe RSS as Sum of squared averages of difference between actual point and our estimation.

Formula for TSS:

From this equation you may observe that TSS is sum of all differences between Trendline and actual value.

Root Mean Squared Error (RSME) and Residual Standard Error (RSE)

Those methods are similar to the above however in RSME, the absolute value is being calculated:

In order to make the calculation relevant and unbiased, RSS need to be divided by degrees of freedom rather than number of observations which would create RSE:

However, R-squared is considered superior over RSE due to the fact that the latter is not normalized, that is as the number of observations changes so does the RSE making it harder to perform Relative analysis of any sort.

Linear Regression Assumptions:

1. X and Y linearity: The first rule is that independent variables or the features of the model need to be linear related to the dependent variable. That is once value X rises value of Y decreases or vice versa. For example if the temperature increases, the ice cream consumption also increases.

2. Independence of residuals: This one is regarding the error of the observations. There cannot be any visible pattern that affects the mismatch between trendline and actual results, which is called autocorrelation

3. Residuals are normally distributed: Another one regarding errors, normal distribution shapes the number of errors the way that naturally the number of small mismatches will be bigger and the higher the mismatch or error, the least frequently it will appear.

4. Variance of residuals being equal: Simply put the error variance need to remain constant, which is referred in statistics as Homoscedasticity

Practical example - Linear Regression in Algorithmic trading, using moving average to predict gold prices

Once you familiarized yourself with essential information about Linear Regression, check out my GitHub repository where I run Gold prices prediction using Linear Regression achieving finally goodness of fit equal to 94.14! Remember this model does not include Transaction Costs and it is advised to be used to learning and relativity analysis purposes only.

Here’s link to my GitHub with Gold Price predicting Machine Learning Model in Python:

Gold price prediction using Liner Regression

And there you have it, the simplest yet very powerful Machine Learning model Linear regression covered in both theory in practice.

I hope you enjoyed the content and looking forward for the next article where we will take a deep dive into next Machine Learning algorithm.

Stay Tuned!

Tomasz

Quant Insights

Discussion about this post