630-936-4771 / Steve@StatisticallySignificantConsulting.com

It is important to understand that *multiple linear regression analysis* is a * procedure* that may or may not produce a multiple linear regression

Once you understand the difference between multiple linear regression analysis (a procedure) and a multiple linear regression model (the result of a multiple linear regression analysis procedure that has 2 or more statistically significant independent variables), we can look at the model building procedures.

Four common multiple linear regression analysis procedures for *attempting* to produce a multiple linear regression model are: 1) standard multiple linear regression analysis, 2) stepwise multiple linear regression analysis, 3) hierarchical multiple linear regression analysis, and 4) a combination of hierarchical and stepwise multiple linear regression analysis.

Enter all independent variables into the model simultaneously, regardless of whether or not the independent variables were individually statistically significantly associated with the dependent variable in bivariate analyses. Although this is a common procedure, it has some limitations.

One limitation is the sample size requirement. A general rule-of-thumb is that you should have between 10 and 20 observations per independent variable in the model. So for example, if you have 6 continuous independent variables, you will need a sample size between 60 and 120. If some of your independent variables are categorical with more than 2 categories, then you will have an additional independent variable for each dummy coded categorical variable. So, you would need an even larger sample size due to more than 6 independent variables in this example.

Secondly, when you enter one or more independent variables into a model, the regression coefficient for each independent variable is adjusted for the other independent variables in the model. So if you enter nonsensical independent variables into a model, that can adversely affect the regression coefficients and associated p-values of the relevant independent variables, producing an invalid model.

A computer algorithm determines which variables to enter into the model and in which order. In the first step of the stepwise procedure, the algorithm searches through all of the independent variables to determine which one(s), if any, are *individually* statistically significantly associated with the dependent variable in bivariate analyses.

In the second step, if none of the independent variables are found to be individually statistically significantly associated with the dependent variable in bivariate analyses, the algorithm stops and there is no model to report since none of the independent variables are statistically significantly associated with the dependent variable. If one or more independent variable(s) are found to be individually statistically significantly associated with the dependent variable, the independent variable with the smallest p-value is entered into the model.

In the third step, the algorithm searches through all of the remaining independent variables and determines if 1 or more explain a statistically significant amount of additional variance in the dependent variable, above and beyond the variance explained by the independent variable that was already entered into the model. If none of the remaining independent variables are statistically significant, the algorithm stops and the result is, you don't have a multiple linear regression model because only 1 independent variable was statistically significant.

If 1 or more of the independent variables explains a statistically significant amount of additional variance in the dependent variable, above and beyond the variance explained by the independent variable that was already entered into the model, the variable with the smallest p-value is *added* to the model. If the addition of the second independent variable causes the first independent variable to become non-statistically significant, then further diagnostics would be required to determine why that might have happened (e.g. violations of assumptions such as multicollinearity).

The algorithm continues in this fashion until none of the remaining independent variables explain a statistically significant amount of additional variance in the dependent variable. If the final model has 2 or more statistically significant independent variables and all of the assumptions have been satisfied, then you have a valid multiple linear regression model.

Independent variables are entered into the model in a specific sequence (i.e. hierarchy of importance). The hierarchy of importance, that is, which variable(s) should be entered first, is sometimes informed by theory and/or empirical evidence. To use a hypothetical example, if theory suggests gender must absolutely be accounted for before any other variables can be evaluated, one might choose to enter gender into the model first, regardless of whether or not gender was statistically significant in their particular dataset.

Even in the absence of theory, if many studies reported in the literature show statistically significant evidence that gender (for example) is the strongest predictor of some dependent variable, one might choose to enter gender into the model whether or not it was statistically significant in their particular dataset.

In the absence of theory or substantial empirical evidence reported in the literature, one might choose to enter independent variables into the model in a hierarchical fashion according to the results from bivariate analyses of their own particular dataset. In this case, the criteria for order of precedence (i.e. hierarchy) for which independent variable to enter first is determined by statistical significance of each independent variable with the dependent variable from bivariate analyses. This is similar to the stepwise model building procedure, but the researcher has more control over the model building procedure.

For example, a researcher may find that from bivariate analyses of each 6 independent variables (hypothetical example) versus the dependent variable, a certain 2 independent variables were statistically significant whereas the other independent variables were not statistically significant. The researcher might logically conclude it makes sense to first enter the independent variable that was most strongly associated with the dependent variable (i.e. smallest p-value). Then, once that variable has been entered into the model, the researcher may attempt to enter the second of the 2 independent variables that was statistically significantly associated with the dependent variable.

As explained above for hierarchical multiple linear regression analysis, either based upon theory, substantial empirical evidence in the literature or based upon results of analyses of the researcher's particular dataset, one or more independent variables may be entered into the model. Once those variables are entered into the model, there may be additional independent variables that could explain additional variance in the dependent variable above and beyond the variance explained by the independent variable(s) already in the model. This can happen as a result of the independent variables in the model explaining some of the variance in the dependent variable, reducing the amount of variation that needs to be explained by additional independent variables, which can make the analysis statistically more powerful.

At that point, there is no rationale for which independent variable(s) to attempt to enter next. That is where the stepwise model selection procedure can be helpful. Now that the researcher has entered the independent variables they feel are most important during the hierarchical model building phase, the researcher might leave it to the computer algorithm, as discussed above for the stepwise model selection procedure, to decide which if any of the remaining independent variables may explain a statistically significant amount of additional variance in the dependent variable above and beyond the variance explained by the independent variables already entered into the model.

When you hire me to do the statistical analysis for your dissertation, I carefully determine the appropriate statistical methods for your study. I can perform virtually any standard statistical analysis (using SPSS) and I provide ongoing statistical help to make sure that you fully understand the statistics used in your research, so you can go into your dissertation defense with confidence.

Simply contact me by phone or email to get started.

Steve Creech

630-936-4771 | Steve@StatisticallySignificantConsulting.com