Principal Component Analysis Classification

In principal component analysis (PCA) classification the full set of particles are simplified into a new lower-dimensional representation by means of Eigen or Singular Value decomposition methods. Particles projected onto these most variable basis-vectors then can be clustered using a variety of methods.

Within subTOM particles are first compared using Constrained Cross-Correlation taking into account the missing wedge. The pairs used in comparison are pre-calculated with the function subtom_prepare_ccmatrix. To speed up calculation particles can be pre-aligned using the function subtom_parallel_prealign.

The comparisons are calculated in parallel batches with subtom_parallel_ccmatrix and the results are combined with subtom_join_ccmatrix. The Cross-Correlation matrix is then decomposed into a user-given number of basis vectors using either Eigenvalue decomposition with subtom_eigs or Singular Value decomposition with subtom_svds, which the basis vectors and their respective weights.

The particles that were compared against are then projected onto these vectors by first constructing a matrix of the aligned data with subtom_parallel_xmatrix_pca and then projected in parallel batches with subtom_parallel_eigenvolumes and joined with subtom_join_eigenvolumes. These volumes are then used to determine the low-rank approximation coefficients in volume space for clustering. A larger particle superset can be projected onto the volumes to speed up classification of large datasets. Coefficients are also calculated in parallel in batches with subtom_parallel_eigcoeffs_pca and joined with subtom_join_eigencoeffs_pca.

Finally using a user-selected subset of the determined coefficients, the data is clustered either by Hierarchical Ascendant Clustering using a Ward distance criterion, K-Means clustering, or a Gaussian Mixture model with the function subtom_cluster. This clustering is then used to generate the final class averages.