ciss_vae.utils.clustering.cluster_on_missing_prop
- cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]
Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the missingness proportion matrix. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.- Parameters:
prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape
(n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in[0, 1].n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering instead. Defaults toNone.seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to
None.k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. IfFalse, uses standard kNN. Defaults toTrue.snn_mutual (bool, optional) – If
True, retains only mutual nearest neighbors when building the SNN graph. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are
"euclidean"or"cosine". Defaults to"euclidean".scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to
False.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score based on the same metric;Noneif undefined (e.g., single cluster).- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2, 3]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.421