2025-05-27 09:35:22
Linear Regression is one of the most fundamental methods utilised within data science, with applications in both prediction and inference. Many practising data scientists have a strong grounding in statistics and linear regression will be extremely familiar to this group. However there are those who are either self-taught, have been trained at an intensive code-focused bootcamp or have a background in computer science, rather than mathematics or statistics.
For this latter group linear regression may not have been considered in depth. It may have been taught in a manner that emphasises prediction, without delving into the specifics of estimation, inference or even the proper applicability of the technique to a particular dataset.
This article series is designed to 'fill the gap' for those who do not have formal training in statistical methods. It will discuss linear regression from the 'ground up', outlining when it should be used, how the model is fitted to data, the goodness of such a fit as well as diagnosis of problems that may lead to bias within the results.
Such theoretical insights are not simply 'nice to haves' for practising data scientists. Many interview questions at some of the top data-driven employers will test advanced knowledge of the technique in order to differentiate between those data scientists who may have briefly dabbled in Scikit-Learn and those who have extensive experience in statistical data analysis.
Having a solid grounding in linear regression will also provide greater intuition as to when it is appropriate to apply a particular model to a dataset. This will ultimately lead to more robust analyses and greater outcomes for your data science objectives.
In this overview article we will briefly discuss the mathematical model of linear regression. We will then provide a roadmap for the subsequent series of articles that go into more depth on a particular aspect. We will also describe the software we will be using to strengthen our knowledge of the technique.
Mathematically, the linear regression model states that a particular continuous response value
Where the parameters are given by
Note that
-dimensional. This is due to the fact that we need to include P parameters plus an intercept term in the model.
Including the '1' in
Where
Informally this states that the vector of response values is equal to a matrix multiplication of the parameters with the matrix of features (n rows, one per sample, with p + 1 features per row) plus a vector of normally distributed errors.
The linear regression model thus attempts to explain the n-dimensional response vector with a much simpler p + 1-dimensional linear model, leaving n - (p + 1)-dimensional random variation in the residuals of the model.
Essentially, the model is trying to capture as much of the structure of the data as possible in p dimensions, where
The task of linear regression is to try and find an appropriate estimate of the
The roadmap below will describe OLS in detail along with some alternative fitting procedures. It will also include some of the issues that can arise when trying to apply linear regression to real world datasets.
Now that we have introduced linear regression we are going to outline how we will proceed in subsequent articles:
As more articles are published on QuantStart they will be added to this roadmap here.
In this series of articles we will make use of the Python programming language and its range of popular freely-available open source data science libraries. We will assume that you have a working Python research environment set up. The most common—and straightforward—approach is to install the freely-available Anaconda distribution.
We will be making use of the following Python libraries:
Scikit-Learn is often the 'go-to' machine learning library with its own implementation of ordinary least squares regression. However we wish to emphasise the theoretical properties and statistical inference of linear regression and hence will mainly be utilising the implementation found in Statsmodels. Those who are familiar with fitting linear models in R will likely find Statsmodels a similar environment.
We have described that the variety of data science backgrounds leads to a wide variance in understanding of the theoretical basis of many statistical models.
We have motivated why learning linear regression is extremely useful, both from the point of view of interview preparation and for producing robust outcomes in data science projects.
Linear regression was briefly introduced with an accompanying learning roadmap. Finally, the software utilised in future articles was described.
Reference :Linear Regression: An Introduction
From https://www.quantstart.com/articles/linear-regression-an-introduction/
2025-01-10 10:12:01
2024-05-31 03:06:49
2024-05-28 03:09:25
There are many other interesting articles, try selecting them from below.
2024-10-10 11:35:58
2024-09-17 01:32:07
2024-03-29 10:32:46
2024-08-19 10:33:55
2023-11-22 02:16:10
2024-01-15 04:13:20
2023-10-12 05:42:28
2024-08-13 11:27:14
2024-09-10 10:52:10