Chapter 2 Review of Basic Statistics

2.1 Calculating Variance and Covariance

2.1.1 Covariance

population: \[{{\sigma }_{\text{XY}}}=\frac{\sum{(X-{{\mu }_{\text{X}}})(Y-{{\mu }_{\text{Y}}})}}{{{N}_{pop}}}\] sample: \[{{s}_{XY}}={{\hat{\sigma }}_{XY}}=Cov(X,Y)=\frac{\sum{(X-\bar{X})(Y-\bar{Y})}}{(n-1)}\]

2.1.2 Correlation

population: \[{{\rho }_{\text{XY}}}=\frac{{{\sigma }_{\text{XY}}}}{\sqrt{{{\sigma }_{\text{X}}}^{2}{{\sigma }_{\text{Y}}}^{2}}}\]

sample: \[{{r}_{\text{XY}}}={{\hat{\rho }}_{\text{XY}}}=\frac{Cov(X,Y)}{\sqrt{{{s}_{\text{X}}}^{2}{{s}_{\text{Y}}}^{2}}}\]

2.1.3 Using Matrix Algebra

Suppose we have a 5 by 2 data matrix X (2 variables and 5 participants, for example). \[\begin{align} & \text{ X1 X2} \\ & \mathbf{X}\text{=}\left[ \begin{matrix} 1 & 3 \\ 4 & -5 \\ 1 & 7 \\ 3 & 2 \\ 7 & -1 \\ \end{matrix} \right] \\ \end{align}\]

First, we’ll calculate the deviation score matrix Xd.

\[{{\mathbf{X}}_{d}}\text{=}\mathbf{X-\bar{X}}=\left[ \begin{matrix} 1 & 3 \\ 4 & -5 \\ 1 & 7 \\ 3 & 2 \\ 7 & -1 \\ \end{matrix} \right]-\left[ \begin{matrix} 3.2 & 1.2 \\ 3.2 & 1.2 \\ 3.2 & 1.2 \\ 3.2 & 1.2 \\ 3.2 & 1.2 \\ \end{matrix} \right]=\left[ \begin{matrix} -2.2 & 1.8 \\ 0.8 & -6.2 \\ -2.2 & 5.8 \\ -0.2 & 0.8 \\ 3.8 & -2.2 \\ \end{matrix} \right]\]

Next, we multiple the transpose of \({{\bf{X}}_{\bf{d}}}\) ( \({{\bf{X}}_{\bf{d}}}^{\bf{'}}\) ) by \({{\bf{X}}_{\bf{d}}}\) itself using matrix operations. We will get the deviation SSCP (sums of squares and cross products) matrix.

Deviation SSCP matrix:
\[{{\mathbf{X}}_{\mathbf{d}}}'{{\mathbf{X}}_{\mathbf{d}}}=\sum{{{x}_{i}}{{x}_{j}}=\left[ \begin{matrix} \sum{x_{1}^{2}} & \sum{{{x}_{1}}{{x}_{2}}} \\ \sum{{{x}_{2}}{{x}_{1}}} & \sum{x_{2}^{2}} \\ \end{matrix} \right]}=\left[ \begin{matrix} 24.8 & -30.2 \\ -30.2 & 80.8 \\ \end{matrix} \right]\] (Note: Lower case \({{x}_{i}}\)’s represent deviation scores.)

Since variance of X1 is \(\frac{1}{N-1}\sum{x_{1}^{2}}\) and the covariance between X1 and X2 is \(\frac{1}{N-1}\sum{{{x}_{1}}{{x}_{2}}}\), the variance and covariance matrix (usually denoted by S) can be obtained by multiplying \(\frac{1}{N-1}\) to the deviation score SSCP:

Variance and Covariance matrix: \[\mathbf{S}=\frac{1}{5-1}\left[ \begin{matrix} 24.8 & -30.2 \\ -30.2 & 80.8 \\ \end{matrix} \right]=\left[ \begin{matrix} 6.2 & -7.55 \\ -7.55 & 20.2 \\ \end{matrix} \right]\]

\[\begin{array}{l} \begin{array}{*{20}{c}} {}&{{{\rm{X}}_{\rm{1}}}}&{{{\rm{X}}_{\rm{2}}}}&{{{\rm{X}}_{\rm{3}}}} \end{array}\\ \begin{array}{*{20}{c}} {{X_1}}\\ {{X_2}}\\ {{X_3}} \end{array}\left[ {\begin{array}{*{20}{c}} {Var({X_1})}&{Cov({X_1},{X_2})}&{Cov({X_1},{X_3})}\\ {Cov({X_2},{X_1})}&{Var({X_2})}&{Cov({X_2},{X_3})}\\ {Cov({X_3},{X_1})}&{Cov({X_3},{X_2})}&{Var({X_3})} \end{array}} \right] \end{array}\]

Sample Covariance Matrix S

\[\begin{array}{l} \begin{array}{*{20}{c}} {}&{{{\rm{X}}_{\rm{1}}}}&{{{\rm{X}}_{\rm{2}}}}&{{{\rm{X}}_{\rm{3}}}} \end{array}\\ \begin{array}{*{20}{c}} {{X_1}}\\ {{X_2}}\\ {{X_3}} \end{array}\left[ {\begin{array}{*{20}{c}} {s_1^2}&{{s_{12}}}&{{s_{13}}}\\ {{s_{21}}}&{s_2^2}&{{s_{23}}}\\ {{s_{31}}}&{{s_{32}}}&{s_3^2} \end{array}} \right] \end{array}\]

Sample Correlation R \[\begin{array}{l} \begin{array}{*{20}{c}} {}&{{{\rm{X}}_{\rm{1}}}}&{{{\rm{X}}_{\rm{2}}}}&{{{\rm{X}}_{\rm{3}}}} \end{array}\\ \begin{array}{*{20}{c}} {{X_1}}\\ {{X_2}}\\ {{X_3}} \end{array}\left[ {\begin{array}{*{20}{c}} 1&{{r_{12}}}&{{r_{13}}}\\ {{r_{21}}}&1&{{r_{23}}}\\ {{r_{31}}}&{{r_{32}}}&1 \end{array}} \right] \end{array}\]

Elements in sample covariance matrix divided by \({s_i}{s_j}\) would result in the sample correlation matrix.

Elements in sample correlation matrix multiplied by \({s_i}{s_j}\) would result in the sample covariance matrix.

2.2 Correlation and Causation

Correlation does not imply causation!

In SEM, you specify a model(s).

This model represents your hypotheses about why the variables in your model are related
- high anxiety causes low achievement (causal)
- anxiety and achievement are spuriously related due to their relationships with self-efficacy (non-causal)
So, you (the researcher) make hypotheses about causal relationships and noncausal relationships.

A hypothesized model that reproduces the original correlations well (fits) does not mean one has established causality. Failure to reject an SEM model does not prove that it is correct. Establishing causality is related to when and how the data were collected—design issues, not the type of analysis. To reasonably infer that X is a cause of Y, all the following conditions must be met:

There is time precedence; that is, X precedes Y in time.
The direction of the causal relation is correctly specified; that is, X causes Y instead of the reverse or that X and Y cause each other. (The latter refers to reciprocal causation).
The relation between X and Y does not disappear when external variables such as common causes of both are held constant (partialed out).

2.3 Data Issues

2.3.1 What form should the data be in?

In order to use a SEM program (AMOS, LISREL, Mplus, EQS, R lavaan, etc.) the data may begin in raw or matrix form.

Raw data should be used when the data are non-normal and an estimation technique is used that takes this into account.

The matrices these programs can use are correlation or covariance. The diagonal of a covariance matrix includes the variance of the variable. Remember, the covariance between 2 variables equals:

\(Co{v_{xy}} = {r_{xy}}(S{D_x})(S{D_y})\)

\({r_{xy}} = \frac{{Co{v_{xy}}}}{{(S{D_x})(S{D_y})}}\)

The diagonal of a correlation matrix equals 1 because the variables are standardized.

2.3.2 What form should the data be in?

For SEM, analyzing a covariance matrix instead of a correlation matrix is advised. This is because many estimation methods assume you are using unstandardized variables. Basically, standard inferential statistical theory applies only to the analysis of a covariance matrix—not correlation. The tests of significance may be incorrect if a correlation matrix is used.

If you have a correlation matrix and the standard deviations of the variables, the programs can produce the covariance matrix.

2.3.3 Missing Data

2.3.3.1 Available Case Methods

If you choose to delete subjects with missing data make sure that you do NOT use pairwise deletion when creating the covariance matrix. Listwise (each subject has a score on all variables) is a better choice than pairwise due to statistical issues. Using pairwise deletion means the covariances can be based on different numbers of people which can produce out of range values which can lead to a non-positive definite matrix or a matrix that has a determinant that is zero. Since SEM program analyze matrices, both cause the computer program to “crash” or produce uninterpretable results.

Specifically, the determinant of a matrix represents the overall variance of the matrix. The determinant is calculated when one wants to divide in matrix algebra (invert a matrix). One can not invert a matrix if a determinant equals zero (or close to zero) because inversion requires dividing by the determinant.

2.3.3.2 Imputation Methods

If listwise deletion leads to a large decrease in the sample size, consider the option of Imputation of Missing Values. This means you are substituting real values for missing values. This can be done in general-purpose statistical software such as SPSS, SAS, and R.

2.3.3.3 Model-based Missing Data Techniques

Missing values can be imputed using other variables. These techniques include the EM (expectation-maximization) algorithm and multiple imputation.

2.3.3.4 Full Information Maximum Likelihood

This method does not involve imputing/substituting missing values. Information from both complete data points and partial data points is used for model statistics.

2.3.4 Multicollinearity

When variables are highly correlated this can lead to a non-positive definite matrix. Extreme multicollinearity is seen if there is a singularity problem. Singularity means one of the variables is a combination of two or more of the other variables.

The regression procedure can be used to assess multivariate multicollinearity. You would want high tolerance (e.g., >.10) and low Variance Inflation Factor (VIF; e.g., < 10)

2.3.4.1 Outliers

One method of checking for univariate outliers is to look at the item’s frequency distribution and plot a histogram. Outliers may simply be due to data entry errors, which can be fixed. Always check this first.

After checking for univariate outliers, check for multivariate outliers. Some do this by looking at bivariate scatterplots; however, this can be very time consuming if there are several observed variables and it does not examine if the case is a multivariate outlier based on all the variables.

Mahalanobis distance can also be used to identify subjects who are multivariate outliers. This can be done via the regression procedure.

2.3.4.2 Normality

Variables should be normally distributed (when using a model fitting estimation procedure such as ML but this is not necessary for OLS or other estimation techniques). The effects of nonnormality on ML-based results really depends on its extent; the greater the nonnormality, the greater the impact upon results.

Both univariate normality and multivariate normality need to be investigated just as one has to do with outliers.

It’s hard to detect multivariate normality. Fortunately, there are methods in SEM that do not require normaility.

2.3.4.3 Homoscedasticity

Homoscedasticity means that the variability in scores for one variable is the same across the values of another variable. When multivariate normality is met, homoscedasticity is not violated. Heteroscedasticity is a problem because it decreases the predictability of one variable from another since the relationship is not consistent across all values.

2.3.4.4 Linearity

For most part of this class, we will only deal with variables that have linear relationships. We will briefly introduce methods for nonlinear relationships (e.g., when variables are ordered categorical instead of continuous).