ciss_vae.utils.clustering.cluster_on_missing_prop

cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]

Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the missingness proportion matrix. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.

Parameters:
  • prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape (n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in [0, 1].

  • n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering instead. Defaults to None.

  • seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to None.

  • k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to 15.

  • use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. If False, uses standard kNN. Defaults to True.

  • snn_mutual (bool, optional) – If True, retains only mutual nearest neighbors when building the SNN graph. Defaults to True.

  • leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to 0.5.

  • leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".

  • metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are "euclidean" or "cosine". Defaults to "euclidean".

  • scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to False.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score based on the same metric;

None if undefined (e.g., single cluster).

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2, 3])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.421