Solution ID: 100000316
Question Subtype: Statistical Distributions
Title:
Generating MVN data with a specific covariance matrix
Description:
Q.
How can I generate data which are multivariate normal and
have a covariance or correlation matrix that I specify?
A.
In SPSS 4.0 or above on non-DOS platforms, this job can be performed
with the matrix language of SPSS. Users of SPSS/PC+ or SPSSX can do
part of the job but will have to perform some matrix algebra outside
of SPSS. The SPSS 4.0 approach is described first, followed by a
description of the SPSSX approach. The general algorithm is described
in Rubinstein, R. (1981), "Simulation and the Monte Carlo Method",
Wiley.
I SPSS 4.0 Method
The basic strategy is:
1. Before entering matrix, generate the required number of normally-
distributed variables by using compute commands. These variables should
be multivariate normal (MVN) with correlations approaching 0,
i.e. approaching independence.
2. To compute a set of variables which are multivariate normal and
strictly independent, or orthogonal, run the FACTOR procedure
with the variables created in Step 1. Use the default extraction
method of principal components, extracting as many components
as you included in the analysis. Save these components to a
new set of variables with the FACTOR /SAVE subcommand. These
new variables will be MVN and orthogonal with means of 0 and
variances of 1.
3. Enter the matrix language and read the set of standard normal variables
from Step 2 as a matrix, Z for example. Then define and enter the
target covariance matrix, S.
4. Calculate the Cholesky factor for the target covariance matrix.
The Cholesky factor is an upper triangular matrix which is the
"square root" of the covariance matrix. If V is the Cholesky
factor of matrix S, and VT is the transpose of V, then
VT*V=S , where * is the matrix multiplication operator.
5. Post-multiply the independent standard normals, Z, by the Cholesky
factor, V, to give a new data matrix, X. If the target matrix was
a correlation matrix and you want the new variables to have a
standard deviation other than 1, then multiply X by the desired SD.
If you want the new variables to have a nonzero mean, add the desired
mean to X (but not before multiplication by SD).
If the target covariance matrix is not a correlation matrix, the
new variables will have variances equal to their respective
diagonal elements in the target matrix. Their means will be 0.
The desired mean can be added to the variables with a compute
statement, either before or after leaving matrix.
6. Save the X matrix to variables in the SPSS active file and leave
matrix.
An example of the implementation of this algorithm is shown below. The
target matrix is a correlation matrix for the five variables to be
generated. I added some statements to print the determinant and the
condition number of the input covariance matrix to insure that it was
not singular or ill-conditioned. I also printed the product of VT*V to
insure that the target matrix was reproduced. All the new variables
were scaled to have means of 100 and standard deviations of 15 before
leaving matrix.
input program.
+ loop #i = 1 to 10000.
+ do repeat response=r1 to r5 .
+ compute response = normal(1) .
+ end repeat.
+ end case.
+ end loop.
+ end file.
end input program.
correlations r1 to r5 / statistics=descriptives.
* Factor procedure computes pr1 to pr5, which are standard MVN .
factor variables = r1 to r5 / print = default det
/criteria = factors(5) /save=reg (all,pr).
* use matrix to set corr matrix.
* x is a 10,000 by 5 matrix of independent standard normals .
* cor is the target covariance matrix.
* cho is the Cholesky factor of cor .
* newx is the 10,000 by 5 data matrix which has target covariance matrix .
matrix.
get x / variables=pr1 to pr5.
compute cor={1, 0.4, 0.3, 0.2, 0.1 ;
0.4, 1, 0.4, 0.3, 0.2 ;
0.3, 0.4, 1, 0.4, 0.3 ;
0.2, 0.3, 0.4, 1, 0.4 ;
0.1, 0.2, 0.3, 0.4, 1 }.
compute deter=det(cor).
print deter / title "determinant of corr matrix" / format=f10.7 .
print sval(cor) / title "singular value decomposition of corr".
print eval(cor) / title "eigenvalues of input corr".
* In a symmetric matrix sval and eigenvalues are identical - choose 1 .
compute condnum=mmax(sval(cor))/mmin(sval(cor)).
print condnum / title "condition number of corr matrix" / format=f10.2 .
compute cho=chol(cor).
print cho / title "cholesky factor of corr matrix" .
compute chochek=t(cho)*cho.
print chochek / title "chol factor premult by its transpose " /format=f10.2 .
compute newx=x*cho.
compute newx=newx*15 + 100.
save newx /outfile=* /variables= nr1 to nr5.
end matrix.
correlations nr1 to nr5 / statistics=descriptives.
II SPSSX and SPSS/PC+ Method
If the matrix language is not available, users can apply the Cholesky
factor matrix to the standard normal variables with a series of
compute commands. However, the Cholesky factorization must be
calculated outside of SPSS. The formulae for the Cholesky are
provided in Rubinstein's book and in Bock (1975), "Multivariate
Statistical Methods in Behavioral Research", McGraw-Hill.
The example below uses the Cholesky factor for the target matrix in
the previous example. The Cholesky factor matrix is printed below.
One compute statement is required for each new variable. Note that the
values in the columns of the Cholesky matrix act as weights for the
original standard normal variables and these weighted variables are
combined to form the new variables.
Cholesky Matrix for the Target Correlation Matrix.
1.0 0.4 0.3 0.2 0.1
0 0.91652 0.30551 0.24004 0.17457
0 0 0.9037 0.29508 0.23976
0 0 0 0.90294 0.29608
0 0 0 0 0.90243
* SPSS-X, SPSS or SPSS/PC+ program - assume an active file exists with
five independent random vars created by compute r1=normal(1).
correlations r1 to r5 / statistics=descriptives.
* This correlation matrix should be nearly an identity matrix, 1's in
the diagonal and 0's elsewhere.
factor variables = r1 to r5 / print = default det
/criteria = factors(5) /save=reg (all,pr).
correlations pr1 to pr5 / statistics=descriptives.
* This correlation matrix should be an identity matrix.
* use series of computes to post-mult data matrix by
a Cholesky matrix calculated outside spss .
compute nr1 = pr1.
compute nr2 = 0.4*pr1 + 0.91652*pr2 .
compute nr3 = 0.3*pr1 + 0.30551*pr2 + 0.9037*pr3.
compute nr4 = 0.2*pr1 + 0.24004*pr2 + 0.29508*pr3 + 0.90294*pr4.
compute nr5 = 0.1*pr1 + 0.17457*pr2 + 0.23976*pr3 + 0.29608*pr4 + 0.90243*pr5.
correlations nr1 to nr5 / statistics=descriptives.
* This correlation matrix should be like the target correlation matrix.