PCA Visualization in MATLAB®
How to do PCA Visualization in MATLAB® with Plotly.
Principal Components of a Data Set
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Find the principal components for the ingredients data.
load hald
coeff = pca(ingredients)
coeff = Columns 1 through 3 -0.067799985695474 -0.646018286568728 0.567314540990512 -0.678516235418647 -0.019993340484099 -0.543969276583817 0.029020832106229 0.755309622491133 0.403553469172668 0.730873909451461 -0.108480477171676 -0.468397518388289 Column 4 0.506179559977705 0.493268092159297 0.515567418476836 0.484416225289198
The rows of coeff
contain the coefficients for the four ingredient variables, and its columns correspond to four principal components.
PCA in the Presence of Missing Data
Find the principal component coefficients when there are missing values in a data set.
Load the sample data set.
load imports-85
Data matrix X
has 13 continuous variables
in columns 3 to 15: wheel-base, length, width, height, curb-weight,
engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm,
city-mpg, and highway-mpg. The variables bore and stroke are missing
four values in rows 56 to 59, and the variables horsepower and peak-rpm
are missing two values in rows 131 and 132.
Perform principal component analysis.
load imports-85
coeff = pca(X(:,3:15));
By default, pca
performs the action specified
by the 'Rows','complete'
name-value pair argument.
This option removes the observations with NaN
values
before calculation. Rows of NaN
s are reinserted
into score
and tsquared
at the
corresponding locations, namely rows 56 to 59, 131, and 132.
Use 'pairwise'
to perform the principal
component analysis.
load imports-85
coeff = pca(X(:,3:15),'Rows','pairwise');
In this case, pca
computes the (i,j)
element of the covariance matrix using the rows with no NaN
values
in the columns i or j of X
.
Note that the resulting covariance matrix might not be positive definite.
This option applies when the algorithm pca
uses
is eigenvalue decomposition. When you don’t specify the algorithm,
as in this example, pca
sets it to 'eig'
.
If you require 'svd'
as the algorithm, with the 'pairwise'
option,
then pca
returns a warning message, sets the algorithm
to 'eig'
and continues.
If you use the 'Rows','all'
name-value
pair argument, pca
terminates because this option
assumes there are no missing values in the data set.
load imports-85
coeff = pca(X(:,3:15),'Rows','all');
...ERROR...
Raw data contains NaN missing value while 'Rows' option is set to 'all'. Consider using 'complete' or 'pairwise' option, or using 'als' algorithm instead.
Weighted PCA
Use the inverse variable variances as weights while performing the principal components analysis.
Load the sample data set.
load hald
Perform the principal component analysis using the inverse of variances of the ingredients as variable weights.
load hald
[wcoeff,~,latent,~,explained] = pca(ingredients,'VariableWeights','variance');
Note that the coefficient matrix, wcoeff
, is not orthonormal.
Calculate the orthonormal coefficient matrix.
load hald
[wcoeff,~,latent,~,explained] = pca(ingredients,'VariableWeights','variance');
coefforth = inv(diag(std(ingredients)))* wcoeff
coefforth = Columns 1 through 3 -0.475955172748972 0.508979384806409 -0.675500187964285 -0.563870242191993 -0.413931487136986 0.314420442819292 0.394066533909305 -0.604969078471438 -0.637691091806566 0.547931191260862 0.451235109330017 0.195420962611708 Column 4 0.241052184051094 0.641756074427214 0.268466110294533 0.676734019481283
Check orthonormality of the new coefficient matrix, coefforth
.
load hald
[wcoeff,~,latent,~,explained] = pca(ingredients,'VariableWeights','variance');
coefforth = inv(diag(std(ingredients)))* wcoeff;
coefforth*coefforth'
ans = Columns 1 through 3 1.000000000000000 0.000000000000000 -0.000000000000000 0.000000000000000 1.000000000000000 0.000000000000000 -0.000000000000000 0.000000000000000 1.000000000000000 -0.000000000000000 0.000000000000000 -0.000000000000000 Column 4 -0.000000000000000 0.000000000000000 -0.000000000000000 1.000000000000000
PCA Using ALS for Missing Data
Find the principal components using the alternating least squares (ALS) algorithm when there are missing values in the data.
Load the sample data.
load hald
The ingredients data has 13 observations for 4 variables.
Perform principal component analysis using the ALS algorithm and display the component coefficients.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
coeff
coeff = Columns 1 through 3 -0.067799985695474 -0.646018286568728 0.567314540990512 -0.678516235418647 -0.019993340484099 -0.543969276583817 0.029020832106229 0.755309622491133 0.403553469172668 0.730873909451461 -0.108480477171676 -0.468397518388289 Column 4 0.506179559977705 0.493268092159297 0.515567418476836 0.484416225289198
Introduce missing values randomly.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN
y = 7 26 6 NaN 1 29 15 52 NaN NaN 8 20 11 31 NaN 47 7 52 6 33 NaN 55 NaN NaN NaN 71 NaN 6 1 31 NaN 44 2 NaN NaN 22 21 47 4 26 NaN 40 23 34 11 66 9 NaN 10 68 8 12
Approximately 30% of the data has missing values now, indicated by NaN
.
Perform principal component analysis using the ALS algorithm and display the component coefficients.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff1,score1,latent,tsquared,explained,mu1] = pca(y,'algorithm','als');
coeff1
coeff1 = Columns 1 through 3 -0.036212338541055 0.821521833463707 -0.525180807034516 -0.683141604556494 -0.099847237932408 0.182836485390674 0.016929757180232 0.557510085824824 0.821489828858883 0.729191057256728 -0.065687977768361 0.126136436504135 Column 4 0.219033475985740 0.699942066753560 -0.118534166411284 0.669369173908958
Display the estimated mean.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff1,score1,latent,tsquared,explained,mu1] = pca(y,'algorithm','als');
mu1
mu1 = Columns 1 through 3 8.995597151700496 47.908800869772193 9.045133815910839 Column 4 28.551458070546648
Reconstruct the observed data.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff1,score1,latent,tsquared,explained,mu1] = pca(y,'algorithm','als');
t = score1*coeff1' + repmat(mu1,13,1)
t = Columns 1 through 3 6.999999999999997 25.999999999999972 5.999999999999996 0.999999999999986 28.999999999999968 15.000000000000011 10.781877383662085 53.023021922864480 7.999999999999999 11.000000000000002 30.999999999999996 13.549987432049472 6.999999999999996 52.000000000000000 6.000000000000012 10.481832029571297 55.000000000000000 7.832799509066273 3.098161701207570 71.000000000000014 11.949103806476206 0.999999999999980 30.999999999999968 -0.516112356202619 1.999999999999988 53.791389384174103 5.770961215451544 21.000000000000014 47.000000000000000 3.999999999999998 21.580891857665534 39.999999999999986 23.000000000000014 10.999999999999996 66.000000000000014 9.000000000000002 9.999999999999998 68.000000000000000 8.000000000000009 Column 4 51.524967145094379 52.000000000000021 19.999999999999996 47.000000000000014 32.999999999999950 17.936171158236188 5.999999999999915 44.000000000000000 22.000000000000000 25.999999999999989 33.999999999999993 5.707816613776032 11.999999999999943
The ALS algorithm estimates the missing values in the data.
Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors. Find the angle between the coefficients found for complete data and data with missing values using ALS.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff1,score1,latent,tsquared,explained,mu1] = pca(y,'algorithm','als');
t = score1*coeff1' + repmat(mu1,13,1);
subspace(coeff,coeff1)
ans = 6.872983007538972e-16
This is a small value. It indicates that the results if you use pca
with 'Rows','complete'
name-value pair argument when there is no missing data and if you use pca
with 'algorithm','als'
name-value pair argument when there is missing data are close to each other.
Perform the principal component analysis using 'Rows','complete'
name-value pair argument and display the component coefficients.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff2,score2,latent,tsquared,explained,mu2] = pca(y,'Rows','complete');
coeff2
coeff2 = -0.205420225953307 0.858688026698708 0.049181247828161 -0.669364063599413 -0.371998098041251 0.551021387596589 0.147356319504436 -0.351257041216125 -0.518714827525170 0.698598880784301 -0.029845918549515 0.651837067815822
In this case, pca
removes the rows with missing values, and y
has only four rows with no missing values. pca
returns only three principal components. You cannot use the 'Rows','pairwise'
option because the covariance matrix is not positive semidefinite and pca
returns an error message.
Find the angle between the coefficients found for complete data and data with missing values using listwise deletion (when 'Rows','complete'
).
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff2,score2,latent,tsquared,explained,mu2] = pca(y,'Rows','complete');
subspace(coeff(:,1:3),coeff2)
ans = 0.357607357590711
The angle between the two spaces is substantially larger. This indicates that these two results are different.
Display the estimated mean.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff2,score2,latent,tsquared,explained,mu2] = pca(y,'Rows','complete');
mu2
mu2 = Columns 1 through 3 7.888888888888889 46.909090909090907 9.875000000000000 Column 4 29.600000000000001
In this case, the mean is just the sample mean of y
.
Reconstruct the observed data.
load hald
[coeff,score,latent,tsquared,explained] = pca(ingredients);
y = ingredients;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.30;
y(ix) = NaN;
[coeff2,score2,latent,tsquared,explained,mu2] = pca(y,'Rows','complete');
score2*coeff2'
ans = Columns 1 through 3 NaN NaN NaN -7.516185143683813 -18.354534728448066 4.096758018165185 NaN NaN NaN NaN NaN NaN -0.564429969715416 5.321307756733086 -3.343158339022808 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.831514933168062 -0.107632488812891 -6.333304231731043 NaN NaN NaN NaN NaN NaN 1.467986895710465 20.634225817831361 -2.929186618770285 Column 4 NaN 22.005631390194360 NaN NaN 3.603980833482452 NaN NaN NaN NaN -3.775776525301393 NaN NaN -18.004319332087810
This shows that deleting rows containing NaN
values does not work as well as the ALS algorithm. Using ALS is better when the data has too many missing values.
Principal Component Coefficients, Scores, and Variances
Find the coefficients, scores, and variances of the principal components.
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Find the principal component coefficients, scores, and variances of the components for the ingredients data.
load hald
[coeff,score,latent] = pca(ingredients);
Each column of score
corresponds to one principal component. The vector, latent
, stores the variances of the four principal components.
Reconstruct the centered ingredients data.
load hald
[coeff,score,latent] = pca(ingredients);
Xcentered = score*coeff'
Xcentered = Columns 1 through 3 -0.461538461538459 -22.153846153846157 -5.769230769230774 -6.461538461538464 -19.153846153846153 3.230769230769226 3.538461538461541 7.846153846153848 -3.769230769230770 3.538461538461538 -17.153846153846143 -3.769230769230777 -0.461538461538459 3.846153846153846 -5.769230769230769 3.538461538461541 6.846153846153848 -2.769230769230769 -4.461538461538460 22.846153846153818 5.230769230769237 -6.461538461538467 -17.153846153846153 10.230769230769228 -5.461538461538463 5.846153846153832 6.230769230769233 13.538461538461540 -1.153846153846135 -7.769230769230773 -6.461538461538466 -8.153846153846160 11.230769230769232 3.538461538461542 17.846153846153840 -2.769230769230766 2.538461538461542 19.846153846153840 -3.769230769230764 Column 4 30.000000000000021 22.000000000000021 -10.000000000000000 17.000000000000000 3.000000000000000 -8.000000000000005 -23.999999999999996 14.000000000000007 -7.999999999999992 -4.000000000000015 4.000000000000006 -18.000000000000007 -18.000000000000000
The new data in Xcentered
is the original ingredients data centered by subtracting the column means from corresponding columns.
Visualize both the orthonormal principal component coefficients for each variable and the principal component scores for each observation in a single plot.
load hald
[coeff,score,latent] = pca(ingredients);
Xcentered = score*coeff';
biplot(coeff(:,1:2),'scores',score(:,1:2),'varlabels',{'v_1','v_2','v_3','v_4'});
fig2plotly(gcf);
All four variables are represented in this biplot by a vector, and the direction and length of the vector indicate how each variable contributes to the two principal components in the plot. For example, the first principal component, which is on the horizontal axis, has positive coefficients for the third and fourth variables. Therefore, vectors v3 and v4 are directed into the right half of the plot. The largest coefficient in the first principal component is the fourth, corresponding to the variable v4.
The second principal component, which is on the vertical axis, has negative coefficients for the variables v1, v2, and v4, and a positive coefficient for the variable v3.
This 2-D biplot also includes a point for each of the 13 observations, with coordinates indicating the score of each observation for the two principal components in the plot. For example, points near the left edge of the plot have the lowest scores for the first principal component. The points are scaled with respect to the maximum score value and maximum coefficient length, so only their relative locations can be determined from the plot.
T-Squared Statistic
Find the Hotelling’s T-squared statistic values.
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Perform the principal component analysis and request the T-squared values.
load hald
[coeff,score,latent,tsquared] = pca(ingredients);
tsquared
tsquared = 5.680340841490951 3.075837035211966 6.000232793912613 2.619763232047952 3.368139445336655 0.566796674771214 3.481840731756962 3.979397901620163 2.608586242833568 7.481756329327637 4.183022234152593 2.232718720505654 2.721567817032093
Request only the first two principal components and compute the T-squared values in the reduced space of requested principal components.
load hald
[coeff,score,latent,tsquared] = pca(ingredients,'NumComponents',2);
tsquared
tsquared = 5.680340841490951 3.075837035211966 6.000232793912613 2.619763232047952 3.368139445336655 0.566796674771214 3.481840731756962 3.979397901620163 2.608586242833568 7.481756329327637 4.183022234152593 2.232718720505654 2.721567817032093
Note that even when you specify a reduced component space, pca
computes the T-squared values in the full space, using all four components.
The T-squared value in the reduced space corresponds to the Mahalanobis distance in the reduced space.
load hald
[coeff,score,latent,tsquared] = pca(ingredients,'NumComponents',2);
tsqreduced = mahal(score,score)
tsqreduced = 3.317920809184487 2.007906801224291 0.587427300774747 1.738161538935849 0.295525889978777 0.422793560910442 3.245670063829735 2.691429716874080 1.361859279884830 2.990295498144076 2.437109148953179 1.378818611291408 1.525081780014097
Calculate the T-squared values in the discarded space by taking the difference of the T-squared values in the full space and Mahalanobis distance in the reduced space.
load hald
[coeff,score,latent,tsquared] = pca(ingredients,'NumComponents',2);
tsqreduced = mahal(score,score);
tsqdiscarded = tsquared - tsqreduced
tsqdiscarded = 2.362420032306464 1.067930233987675 5.412805493137867 0.881601693112104 3.072613555357878 0.144003113860771 0.236170667927227 1.287968184746083 1.246726962948738 4.491460831183561 1.745913085199414 0.853900109214246 1.196486037017995
Percent Variability Explained by Principal Components
Find the percent variability explained by the principal components. Show the data representation in the principal components space.
Load the sample data set.
load imports-85
Data matrix X
has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg.
Find the percent variability explained by principal components of these variables.
load imports-85
[coeff,score,latent,tsquared,explained] = pca(X(:,3:15));
explained
explained = 64.342913356114238 35.448392662688399 0.154993220067715 0.037863222808984 0.007807396888358 0.004771746794250 0.001304050511812 0.001056438605476 0.000529182750646 0.000186613509121 0.000158017147436 0.000017269286049 0.000006822827515
The first three components explain 99.95% of all variability.
Visualize the data representation in the space of the first three principal components.
load imports-85
[coeff,score,latent,tsquared,explained] = pca(X(:,3:15));
scatter3(score(:,1),score(:,2),score(:,3))
axis equal
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
zlabel('3rd Principal Component')
fig2plotly(gcf);
The data shows the largest variability along the first principal component axis. This is the largest possible variance among all possible choices of the first axis. The variability along the second principal component axis is the largest among all possible remaining choices of the second axis. The third principal component axis has the third largest variability, which is significantly smaller than the variability along the second principal component axis. The fourth through thirteenth principal component axes are not worth inspecting, because they explain only 0.05% of all variability in the data.
To skip any of the outputs, you can use ~
instead in the corresponding element. For example, if you don’t want to get the T-squared values, specify
load imports-85
[coeff,score,latent,~,explained] = pca(X(:,3:15));