Multivariate Statistical Analysis Classification

In Multivariate Statistical Analysis (MSA) classification the full set of particles are simplified into a new lower-dimensional representation by means of EigenValue decomposition. Particles projected onto these most variable basis-vectors then can be clustered using a variety of methods.

Within subTOM particles are first compiled into a 2-D Matrix denoted here as the X-Matrix, which holds the aligned, band-pass filtered and masked particle data. To speed up calculation particles can be pre-aligned using the function subtom_parallel_prealign. Batches of the X-Matrix are calculated in parallel with subtom_parallel_xmatrix_msa and then combined and column-centered with subtom_join_xmatrix.

Next the X-Matrix is used to calculated the covariance matrix which is scaled using the so-called ‘modulation metric’ as described in L. Borland and M. van Heel in J. Opt. Soc. Am. A 1990, which is similar to the Chi-Square metrics used in Correspondance Analysis of ordinal data. This covariance matrix is then decomposed into it’s Eigenvectors and Eigenvalues and these are used along with the X-Matrix to determine the Eigenvolumes of the dataset with subtom_eigenvolumes_msa.

These volumes are then used to determine the low-rank approximation coefficients in volume space for clustering. A larger particle superset can be projected onto the volumes to speed up classification of large datasets. Coefficients are also calculated in parallel in batches with subtom_parallel_eigcoeffs_msa and joined with subtom_join_eigencoeffs_msa.

Finally using a user-selected subset of the determined coefficients, the data is clustered either by Hierarchical Ascendant Clustering using a Ward distance criterion, K-Means clustering, or a Gaussian Mixture model with the function subtom_cluster. This clustering is then used to generate the final class averages.