Typically, when a researcher wants to determine the linear relationship between the target and one or more predictors, the one test that would occur to the researcher is the linear regression model.
Linear regression attempts to analyse whether one or more predictor variables explain the dependent variables. While one variable is considered to be explanatory, the other is deemed to be a dependent variable. The linear regression line represented by, Y = a + bX, where ‘Y’ is the dependent variable, ‘X’ is an explanatory variable, ‘a’ is the intercept and ‘b’ is the slope.
For instance, a researcher would want to relate the heights of individuals to their weights using this test. Prior to trying to fit a linear model to observed data, the researcher must investigate whether there is a relationship between the interested variables. To determine this, a scatterplot is used. If no association between the explanatory and dependent variables exists, then fitting a linear regression model to the data will not deliver a useful model.
The numerical measure of association between two variables is known as the correlation coefficient, and the value lies between -1 and 1.
The linear regression test has five key assumptions
- Linearity relationship between independent & dependent variable
- Statistical independence of errors (no correlation between consecutive errors particular in time series data)
- Homoscedasticity of errors
- Normality of error distribution
- No or little multicollinearity
If any of these assumptions are violated, then the scientific insights, forecasts yielded may be inefficient or biased/misleading. Therefore, it becomes a mandate to diagnose the assumptions and find the right solution.
Diagnosis – Non-linearity is evident in the plot of residuals vs predicted values or observed vs predicted values. The points must be symmetrically distributed around a horizontal line in the former plot, whereas in the latter plot it must be distributed around a diagonal line. This is followed by careful investigation for evidence of a ‘bowed’ pattern, implying that during large or small predictions, the model makes systematic errors.
Solution – The best way to fix the violated assumption is incorporating a nonlinear transformation to the dependent and/or independent variables. For example, if the data is positive, you can consider the log transformation as an option. Applying a log transformation to the dependent variable is equivalent to an assumption of growing or decaying of the dependent variable exponentially as a function of the independent variables. Applying it to the dependent as well as the independent variables is equivalent to an assumption that the impact of the independent variables are multiplicative and not additive in their original units. This indicates that a small percentage change in any one of the independent variables results in proportional percentage change in the desired value of the dependent variable.
- Violation of independence
Diagnosis – Investigate residual time series plot (residuals vs row number) and a residual autocorrelations. Residual autocorrelations must fall within the 95% confidence bands around zero ( i.e., nearest plus-or-minus values to zero). Look for significant correlations at the first lags and in the vicinity of the seasonal period as they are fixable.
Solution – You can add lags of the dependent variable and/or lags of the independent variables. Alternatively, if you have an ARIMA+regressor procedure, add an AR(1) or MA(1) to the regression model. While an AR(1) adds a lag of the dependent variable, an MA(1) term adds a lag of the forecast error. If there is seasonality in the model, it can be managed by various ways: (i) seasonally adjust the variables or (ii) include seasonal dummy variables to the model.
- Violation of homoscedasticity
Diagnosis – Investigate residuals vs predicted values plot and in case of time series data, look at residuals vs time plot. Due to the imprecision in the coefficient estimates, the errors tend to be larger for forecasts associated with predictions. Therefore, develop plots of residuals vs independent variables and check for consistency.
Solutions – If the dependent variable is positive and the residual vs predicted plot represents that the size of the errors is directly proportional to the size of the predictions, a log transformation is applied to the dependent variable. If it has already been applied, then the additive seasonal adjustment is used (similar to linearity assumptions).
Diagnosis – The best to check normally distributed errors is by using a normal probability plot. This is a fractiles of error distribution vs the fractiles of a normal distribution plot. If the distribution is normal, then the points on the plot will be close to the diagonal reference line. An S-shaped pattern of deviations determines that either there are too many or two few large errors in both directions. On the other hand, a bow-shaped pattern of deviations indicates that the residual has excessive errors in one direction.
Solutions – The best solution is the utilisation of nonlinear transformation of variables. An example of nonlinear transformation is log transformation. However, this solution is only used if the errors are not normally distributed.
- Violations of multicollinearity
Diagnosis – To determine the correlation effect among variables, use a scatter plot. Alternatively, you can also use VIF factor. VIF value of >= 10 indicates serious multicollinearity. On the other hand, if the value <= 4 implies there is no multicollinearity.
Solution – The best way to eliminate multicollinearity is to remove one of VIF (out of two) from the model. You can use stepwise regression or best subsets regression to remove VIF. If not use Partial Least Square regression (PLS) to cut down the number of predictors.
By leveraging the solutions mentioned above, fix the violations, control & modify the analysis and explore the true potential of the linear regression model.