Till now we studied the relationship between one dependent variable and one independent variable. What about the scenario when two or more variables?
Multiple Linear Regression is used to estimate the relationship between two or more independent variables and one dependent variable. Multiple linear regression acts as a statistical technique used to predict the outcome of a variable based on the value of two or more variables. It is also known simply as multiple regression.
Think it this way, your height depends on the nutrition you take. But is it the only factor? The height of your parents, your physical fitness, and your environment also play an important role. Thus, your height is dependent on more than one factor/variable.
Note: We will reserve the term multiple regression for models with two or more predictors(independent values) and one response(dependent values). There are regression models with two or more response variables which is not the case here.
In this chapter, we will learn a new method for computing the parameter estimates of multiple regression models. This method is more compact but convenient enough for cases when the number of unknown parameters is large. So let’s begin!
Relation with Linear Regression:
Remember that Linear Regression has one to one relationship. We utilized only one independent variable to explain the variation in the dependent variable. Now in multiple regression, we have many to one relationship. This implies that we have two or more independent variables to utilize to predict the variation in the value of a dependent variable.
Thus, Multiple Regression can also be referred to as an extension of linear regression.
The addition of more independent variables creates more relationships among them. So not only are the independent variables potentially related to the dependent variable, but they are also related to each other. When this happens, it is called multicollinearity. For instance, consider you are eating your dinner and you put rock salt and table salt in your meal, all you know is that your meal tastes salty. But can you tell the difference between the two salts? No! This is because both the salts have now the same relationship with your dinner. There is no distinction left between the independent variables which create a problem to estimate which salt is more responsible for the saltiness in your meal.
Hence, it is ideal for all independent variables to be related to the dependent variable and not with each other.
Multiple Regression Equation
Earlier we studied the equation of line while deriving the equation for the Simple Linear Regression model. Here, it varies by changes in the number of variables. Let’s have a look.
On the right-hand side of the equation, we have the sum of linear parameters. So, beta sub-zero which is our intercept then we have beta-one x-one which is the first variable and its coefficient, beta-two x-two which is the second variable along with its coefficient, and so on till ‘i’, that is the number of parameters considered.
Note: We are not considering the error term as of now.
So, we follow the same basic form! Intercept plus the coefficients paired with variable give the estimate for our multiple regression model.
Let’s start with an example for conducting our analysis of multiple regression. Consider a random sample where the height and weight of an individual are independent variables and BMI (Body Mass Index) acts as the dependent variable.
Weight (in kgs)
Height (in cms)
From the table, we know that an individual weighing 28kgs and have a height of 121.92 centimeters, owns a BMI of 18.8. BMI acts as the dependent variable i.e. y. Weight and Height are the independent variables denoted as x1, x2 respectively.
According to what we discussed earlier, there are 2 relationships to analyze with respect to the independent and dependent variable and 1 relationship between the independent variables. Thus, in total, we have 3 relationships to analyze. For checking multicollinearity in these 3 relationships.
- Look at the independent variables to the dependent variable scatterplots.
Recall forming regression equation of the above graph with the given data points. On doing that you may find the following equation for Weights(x1) and BMI(y).
This means, increase in 1 kg of weight will increase BMI by 0.1109.
Similarly, form a regression equation of the above graph with the given data points. On doing that you may find the following equation for Heights(x2) and BMI(y).
This means, increase in 1 cm of height will increase BMI by 0.1077.
Summary: In the first case, our first independent variable, the weight of the individual has a strong linear relationship with BMI. Our second independent variable, the height of the individual also has a strong linear relationship with BMI as shown in the graph. Thus, BMI appears highly correlated with weight and height variables.
2. Look at the relation between the independent variables scatterplots.
Now the above graph has a straight line that goes from the bottom left to the top right and shows us that we have a problem because the weight and height seem to be highly correlated, have a highly linear relationship. This means the model doesn’t know what coefficients to assign to these two variables if they seem so similar. Our model seems fine but this factor tells us we might have some problems. We will see that in further sections.
By now we have done some visual examination of lines from scatter plots. Let’s summarise!
- Generate a list of potential variables, both independent and dependent.
- Collect data according to the variables.
- Check the relationship between each independent and dependent variable using scatterplots and correlations.
- Check for multicollinearity among the independent variables.
- Conduct simple linear regressions for each pair.
- Use the non-redundant independent variables in the analysis to find the best-fitting model.
- Use the best-fitting model for predicting the value of the dependent variable.
VIF (Variance Inflation Factor): It points out variables that are collinear that is, it points out multicollinearity. It finds how much the variance of an estimated regression coefficient increases if your predictors are correlated. Ranges for VIF:
If VIF is 1 then the model has no correlated factors. If VIF is between 5 and 10, it indicates a high correlation that may be problematic in some cases. If VIF goes above 10, you can tell the multicollinearity is present in the model.
A dummy variable is an indicator variable to represent categorical data. It is also termed as quantitative variables. Their range of values is restricted and hence they can take on only two quantitative values. Practically, regression results are easiest to predict when dummy variables are limited to two specific values, 1 or 0. Where, 1 represents the presence of a qualitative attribute, and 0 represents the absence.