After completion of this topic, you should be able to:
Explain the concept of multiple regression
Justify the need to take into consideration the assumptions of multiple regression
Understanf how multiple regression is used in making predictions
Compute and justify the use of the multiple regression
Interpret the SPSS output for multiple regression
Selection of the DV and the set of IVs
You should from the onset take extra care in selecting the DV and the IVs. Since multiple regression is "dependence technique' you must specify which which is the DV and which are the IVs [sounds like commonsense but unfortunately beginning researchers do get confused!]. As mentioned earlier, selection of both the DV and IVs should be based on a sound theoretical or conceptual framework. Obviously, the selection of variables will be determined by the research questions or research problem.
The Dependent Variable (DV) should be a metric or continuous variable which means it should be have score such as 1, 20, 30 or 99 (such as scores in a mathematics test, or GPA and so forth). If your Dependent Variable is categorical such 1 = low, 2 = average and 3 = high, then a different regression method called Logistic Regression should be used for categorical variable [which is not discussed here].
The size of the sample when using multiple regression is an important consideration which you should give serious thought before you begin your study. The sample size determines whether there is a significant relationship between the DV and the set of IVs and the generalisability of the findings.
Graduate students often ask the question: How many subjects is enough? The safest way to answer this question is to refer to the guidelines provided by J. Cohen and P. Cohen (see Table below).
First you have to understand the meaning of "power" in multiple regression. Power is the probability or likelihood of obtaining a statistical significant R-square (R²) for a specified sample size. Cohen and Cohen (1983) used a power of .80 and calculated the minimum R² that the specified sample size will detect as statistically significant at alpha of .05. The Table below is the result which gives you an indication what you sample size should be.
How do you interpret the Table? Say for example, in your study you have Two IVs (i.e. motivation & self-esteem) and you DV is academic achievement. According to the Table, at a significant level of .05, if you select 20 subjects, you will detect R² values of 39% and above. What does this mean? Relationships lower than 39% will not be deemed statistically significant.
You will notice that if you increase your sample size to 100, the R² value if 10% and if you increase it to 250 the R² value is 4%. So the larger the sample size, the greater is you likelihood of obtaining a significant relationship (at .05) between the DV and the two IVs.
If you have five IVs in your study and your sample size is only 20, the R² value has to be 48% and above to be significant (at .05) and if it is lower, it will not be significant. However, if you increase the sample size to 250, the R² value is 5% which means that any relationship of 5% and above will be significant.
b) What is multicollinearity important when using multiple regression?
c) How large should your sample be when using multiple regression?
What is Multiple Regression Used For?
Multiple Regression is oftentimes used to objectively assess the degree and character of the relationship between the DV and the set of IVs. For example, you want to know the relationship between Academic Performance (DV) and IQ (IV1), Motivation (IV2) and Self-Esteem (IV3). See diagram below. To repeat, when using Multiple Regression, researchers use the term “Independent Variables (IV)” to identify those variables that they think will influence some other “Dependent Variable (DV)”.
Having more than one Independent Variable (or predictor variable) is useful when predicting human behaviour (which we do in education). Our actions, thoughts and emotions are all likely to be influenced by some combination of several factors. Using multiple regression we can test theories (or models) about precisely which set of variables is influencing our behaviour; and in this case it is 'academic performance'. You should keep in mind, that selection of the set of IVs should be based on a theory that explains the relationship. For example, you certainly would not want to include "head size" as an IV to predict academic performance because there is no theoretical basis for such a relationshipl even if there is, it does not make much sense studying it!.
What is Multiple Regression?
In Topic 11, we discussed about Linear Regression which is used when you want to predict the value of one variable (y) based on the information you have about another variable (x). For example, using linear regression you are able to predict performance in mathematics based on information you have about attitudes towards mathematics. Here you are predicting the Dependent Variable (Mathematics Scores) based on a one IV (Attitudes Towards Mathemetics).
Multiple Regression is a multivariate statistical technique used to analyse the relationship between a SINGLE DEPENDENT VARIABLE (DV) and SEVERAL INDEPENDENT VARIABLES (IV). The single DV is also called the Criterion Variablewhile the several IVs are called Predictor Variables. Multiple Regression is used to study the relationship between a single dependent variable (DV) and several independent variable (IV).
What Does Multiple Regression Tell You?
Multiple Regression tells you the following characteristics about the relationship between the DV and the set of IVs:
Direction of the Relationship - Whether the relationship between the DV and the set of IVs is positve (+) or negative (-).
Magnitude of the Relationship - The size of the correlation coefficient between the DV and the set of IVs which indicates the relative importance of each predictor or IV.
Nature of the Relationship - Whether the relationship between the DV and the set of IVs is linear or other types relationship such curvilinear.
Indicate Reduncancy - The relationship between the DV and the set of IVs may make some IVs redundant in their predctive effort and are not needed to produce the optimal prediction. Individually, a particular IV may be correlated to the DV but together with other IVs (i.e. in a multivariate context), the variable may not be needed as other variables are explaining the variance.
Significance Level alpha = 0.05
Number of Independent Variables
Sample Size 2 5 10 20
20 39 48 64 NA
50 19 23 29 42
100 10 12 15 21
250 4 5 6 8
500 3 4 5 9
1000 1 1 2 2
To what extent are the Independent Variables and the Dependent Variable correlated? Generally, you want the Dependent Variable to be correlated with each Independent Variable. For example, you surely do not want to correlate head size with academic performance!
On the other hand, each Independent Variable (self-esteem) should not be strongly correlated with other independent variables (such as motivation, IQ). However, when dealing with human behaviour it is common for the Independent Variables to be correlated.
What do you think will happen when there is a high correlation between the different Independent Variables? Such high correlations cause problems when trying to draw inferences about the relative contribution of each independent variable to the dependent variable. Is it attitude or motivation that contributed to academic performance? [Fortunately, SPSS provides a method for checking for multicollinearity]. .
Table Showing Sample Required for Use of Multiple Regression
[source: J. Cohen & P. Cohen (1983). Applied Multiple Regression/Correlation Analysis for the Behavioural Sciences. Hillsdale, NJ: Lawrence Erlbaum]